CINXE.COM
Capture Long-Seq (CLS) on Human and Mouse lincRNAs
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <!--[if IE]><meta http-equiv="X-UA-Compatible" content="IE=edge"><![endif]--> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="generator" content="Asciidoctor 1.5.5"> <meta name="description" content="Capture Long-Seq (CLS) Companion Website"> <meta name="keywords" content="CLS, RNA capture, captureseq lncRNA, long non-coding RNA, GENCODE, mouse, human, PacBio"> <title>Capture Long-Seq (CLS) on Human and Mouse lincRNAs</title> <link rel="stylesheet" href="./css/jul.css"> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.6.3/css/font-awesome.min.css"> </head> <body class="article toc2 toc-left"> <div id="header"> <h1>Capture Long-Seq (CLS) on Human and Mouse lincRNAs <span class="image" style="float: center"><img src="./img/capture_horizontal_flowchart.png" alt="capture horizontal flowchart"></span></h1> <div id="toc" class="toc2"> <div id="toctitle">Contents</div> <ul class="sectlevel1"> <li><a href="#_summary">Summary</a></li> <li><a href="#_supplementary_data">Supplementary Data</a> <ul class="sectlevel2"> <li><a href="#_track_hub">Track Hub</a></li> <li><a href="#_merged_transcript_models">Merged Transcript Models</a></li> <li><a href="#__gencode_lncrna_annotations">"GENCODE+" lncRNA annotations</a></li> <li><a href="#_read_alignments">Read alignments</a></li> <li><a href="#_transcription_start_sites">Transcription Start Sites</a></li> <li><a href="#_polya_reads_sites_and_signals">PolyA reads, sites and signals</a></li> <li><a href="#_splice_junctions">Splice junctions</a></li> <li><a href="#_capture_probe_coordinates">Capture probe coordinates</a></li> </ul> </li> <li><a href="#_software">Software</a></li> <li><a href="#_contact">Contact</a></li> </ul> </div> </div> <div id="content"> <div id="preamble"> <div class="sectionbody"> <div class="paragraph"> <p>Welcome to the companion website of our GENCODE lincRNA Capture Long-Seq study (<a href="http://dx.doi.org/10.1038/ng.3988" target="_blank">Lagarde <em>et al.</em></a>, <strong>"High-throughput annotation of full-length long non-coding RNAs with Capture Long-Read Sequencing"</strong>, <em>Nature Genetics</em>, 2017).</p> </div> <div class="paragraph"> <p>The aim of this site is to enable easy navigation and download of the various datasets and software related to this project.</p> </div> </div> </div> <div class="sect1"> <h2 id="_summary">Summary</h2> <div class="sectionbody"> <div class="paragraph"> <p>Accurate annotations of genes and their transcripts is a foundation of genomics, but no annotation technique presently combines throughput and accuracy. As a result, reference gene collections remain incomplete: many gene models are fragmentary, while thousands more remain uncatalogued - particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the <a href="http://www.gencodegenes.org/" target="_blank">GENCODE consortium</a> has developed RNA Capture Long Seq (CLS), combining targeted RNA capture with third-generation long-read sequencing (PacBio). We present an experimental re-annotation of the GENCODE intergenic lncRNA population in matched human and mouse tissues, resulting in novel transcript models for 3,574 / 561 gene loci, respectively. CLS approximately doubles the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enable us to definitively characterize the genomic features of lncRNAs, including promoter- and gene-structure, and protein-coding potential. Thus CLS removes a longstanding bottleneck of transcriptome annotation, generating manual-quality full-length transcript models at high-throughput scales.</p> </div> </div> </div> <div class="sect1"> <h2 id="_supplementary_data">Supplementary Data</h2> <div class="sectionbody"> <div class="paragraph"> <p>All genome-mapped data are based on versions <em>GRCh38</em> (a.k.a. <em>hg38</em>, human) and <em>GRCm38</em> (a.k.a. <em>mm10</em>, mouse).</p> </div> <div class="paragraph"> <p>In addition to the files mentioned in this section, raw (FASTQ files) and processed files (BED files of capture probes and merged transcript models) are available from the GEO database under accession <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93848" target="_blank">GSE93848</a>.</p> </div> <div class="sect2"> <h3 id="_track_hub">Track Hub</h3> <div class="paragraph"> <p>The GENCODE CLS Track Hub can be loaded directly into the UCSC Genome Browser using these links:</p> </div> <div class="ulist"> <ul> <li> <p><a href="http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&hubUrl=http://public-docs.crg.es/rguigo/CLS/data/trackHub/hub.txt" target="_blank">Human</a> (genome assembly: hg38)</p> </li> <li> <p><a href="http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=mm10&hubUrl=http://public-docs.crg.es/rguigo/CLS/data/trackHub/hub.txt" target="_blank">Mouse</a> (genome assembly: mm10)</p> </li> </ul> </div> </div> <div class="sect2"> <h3 id="_merged_transcript_models">Merged Transcript Models</h3> <div class="paragraph"> <p>Transcript models were built from <span class="tooltip">HCGMs<span class="tooltiptext"><u>H</u>igh-<u>C</u>onfidence <u>G</u>enome <u>M</u>appings are PacBio read alignments with the following characteristics:<br>- <i>Uniquely mapped</i><br>- <i>Double-bounded</i>: the read contains both library adapters (<i>i.e.</i>, its cDNA insert is believed to be fully sequenced)<br>- If spliced, its constituting introns must all be canonical (GT|GC/AG)<br>- If unspliced, it must bear a detectable polyA tail</span></span> using two distinct approaches, <strong>"anchored"</strong> and <strong>"standard/non-anchored"</strong>. Each method has its own pros and cons:</p> </div> <div class="ulist"> <ul> <li> <p>The <strong>"anchored"</strong> procedure prevents the transcripts whose end(s) are supported by CAGE or polyA data from being merged into a longer, compatible transcript "container". This preserves all supported transcript ends in the output, including internal sites. It’s also the set analyzed in the paper. On the other hand, many distinct models of this set differ only by their TSS/TTS, <em>i.e.</em>, it contains many redundant intron chains.</p> </li> <li> <p>The <strong>"standard/non-anchored"</strong> approach produces merged transcript models the same way GENCODE, or standard merging software (<em>e.g. Cuffmerge</em>), do. This means that all transcripts with compatible intron chains are considered redundant, and thus merged into a single "container", regardless of their end support. Its main disadvantage is that it discards many alternative, internal TSSs/TTSs in the process. On the plus side, this procedure guarantees that intron chains are totally non-redundant in the merged set.</p> </li> </ul> </div> <div class="sect3"> <h4 id="__anchored_method">"Anchored" method</h4> <div class="ulist"> <ul> <li> <p>GTF files are listed <a href="./data/mergedTMs/anchored/gtf/" target="_blank">here</a>.</p> </li> <li> <p>Equivalent BED files are available from <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93848">GEO</a>.</p> </li> <li> <p>Transcript-to-biotype assignments are listed <a href="./data/mergedTMs/anchored/TMbiotypes/" target="_blank">here</a>. These files also contain the identifiers of the genomic features overlapped (mostly GENCODE gene_ids) by each transcript structure.</p> </li> <li> <p>Transcripts with <strong>HiSeq-supported</strong> splice junctions are listed <a href="./data/mergedTMs/anchored/HiSeq-supported/" target="_blank">here</a> (contains also mono-exonic transcript models).</p> </li> </ul> </div> </div> <div class="sect3"> <h4 id="_standard_method_non_anchored">Standard method (non-anchored)</h4> <div class="ulist"> <ul> <li> <p>GTF files are listed <a href="./data/mergedTMs/non-anchored/gtf/" target="_blank">here</a>.</p> </li> <li> <p>Transcript-to-biotype assignments are listed <a href="./data/mergedTMs/non-anchored/TMbiotypes/" target="_blank">here</a>. These files also contain the identifiers of the genomic features overlapped (mostly GENCODE gene_ids) by each transcript structure.</p> </li> <li> <p>Transcripts with <strong>HiSeq-supported</strong> splice junctions are listed <a href="./data/mergedTMs/non-anchored/HiSeq-supported/" target="_blank">here</a> (contains also mono-exonic transcript models).</p> </li> </ul> </div> </div> </div> <div class="sect2"> <h3 id="__gencode_lncrna_annotations">"GENCODE+" lncRNA annotations</h3> <div class="paragraph"> <p><a href="./data/gencodePlus/" target="_blank">GENCODE+</a> consists in the union of GENCODE and CLS lncRNAs.</p> </div> </div> <div class="sect2"> <h3 id="_read_alignments">Read alignments</h3> <div class="paragraph"> <p>Post-capture only.</p> </div> <div class="sect3"> <h4 id="_raw_bam_files">Raw BAM files</h4> <div class="paragraph"> <p>Listed in <a href="./data/BAMs/" target="_blank">this directory</a>. These files correspond to the ones loaded into the project’s Track Hub.</p> </div> <div class="paragraph"> <p><strong>Warning</strong>: these alignments are <strong>unstranded</strong>. See below for a stranded version.</p> </div> </div> <div class="sect3"> <h4 id="_processed_pacbio_read_alignments_span_class_tooltip_hcgms_span_class_tooltiptext_u_h_u_igh_u_c_u_onfidence_u_g_u_enome_u_m_u_appings_are_pacbio_read_alignments_with_the_following_characteristics_br_i_uniquely_mapped_i_br_i_double_bounded_i_the_read_contains_both_library_adapters_i_i_e_i_its_cdna_insert_is_believed_to_be_fully_sequenced_br_if_spliced_its_constituting_introns_must_all_be_canonical_gt_gc_ag_br_if_unspliced_it_must_bear_a_detectable_polya_tail_span_span">Processed PacBio read alignments (<span class="tooltip">HCGMs<span class="tooltiptext"><u>H</u>igh-<u>C</u>onfidence <u>G</u>enome <u>M</u>appings are PacBio read alignments with the following characteristics:<br>- <i>Uniquely mapped</i><br>- <i>Double-bounded</i>: the read contains both library adapters (<i>i.e.</i>, its cDNA insert is believed to be fully sequenced)<br>- If spliced, its constituting introns must all be canonical (GT|GC/AG)<br>- If unspliced, it must bear a detectable polyA tail</span></span>)</h4> <div class="paragraph"> <p>Listed in <a href="./data/readGTFs/" target="_blank">this directory</a> (in GTF format).</p> </div> <div class="paragraph"> <p>Each read alignment is stranded according to its intron motifs and polyA tail location (see Methods section of the paper for more details).</p> </div> </div> <div class="sect3"> <h4 id="_read_biotypes">Read biotypes</h4> <div class="paragraph"> <p>PacBio read-to-biotype assignments are listed <a href="./data/readBiotypes/" target="_blank">here</a> (<span class="tooltip">HCGMs<span class="tooltiptext"><u>H</u>igh-<u>C</u>onfidence <u>G</u>enome <u>M</u>appings are PacBio read alignments with the following characteristics:<br>- <i>Uniquely mapped</i><br>- <i>Double-bounded</i>: the read contains both library adapters (<i>i.e.</i>, its cDNA insert is believed to be fully sequenced)<br>- If spliced, its constituting introns must all be canonical (GT|GC/AG)<br>- If unspliced, it must bear a detectable polyA tail</span></span> only). These files also contain the identifiers of the genomic features overlapped (mostly GENCODE gene_ids) by each read.</p> </div> </div> </div> <div class="sect2"> <h3 id="_transcription_start_sites">Transcription Start Sites</h3> <div class="paragraph"> <p>Transcription Start Sites (TSSs) derived from <span class="tooltip">HCGMs<span class="tooltiptext"><u>H</u>igh-<u>C</u>onfidence <u>G</u>enome <u>M</u>appings are PacBio read alignments with the following characteristics:<br>- <i>Uniquely mapped</i><br>- <i>Double-bounded</i>: the read contains both library adapters (<i>i.e.</i>, its cDNA insert is believed to be fully sequenced)<br>- If spliced, its constituting introns must all be canonical (GT|GC/AG)<br>- If unspliced, it must bear a detectable polyA tail</span></span> are available in two flavors:</p> </div> <div class="sect3"> <h4 id="_raw_tsss">Raw TSSs</h4> <div class="paragraph"> <p><a href="./data/TSSs/raw/" target="_blank">These BED files</a> contain the coordinates of all raw TSSs called using <span class="tooltip">HCGMs<span class="tooltiptext"><u>H</u>igh-<u>C</u>onfidence <u>G</u>enome <u>M</u>appings are PacBio read alignments with the following characteristics:<br>- <i>Uniquely mapped</i><br>- <i>Double-bounded</i>: the read contains both library adapters (<i>i.e.</i>, its cDNA insert is believed to be fully sequenced)<br>- If spliced, its constituting introns must all be canonical (GT|GC/AG)<br>- If unspliced, it must bear a detectable polyA tail</span></span>. The corresponding PacBio reads are also identified.</p> </div> </div> <div class="sect3"> <h4 id="_clustered_tsss">Clustered TSSs</h4> <div class="paragraph"> <p><a href="./data/TSSs/clustered/" target="_blank">These BED files</a> contain the coordinates of all merged TSSs (one record per non-redundant TSS). The PacBio reads contributing to each site are also identified.</p> </div> <div class="paragraph"> <p>Clustered TSSs split by known / novel status (with respect to GENCODE) are available in <a href="./data/TSSs/clustered/knownVsNovelTSSs/" target="_blank">this directory</a>.</p> </div> </div> </div> <div class="sect2"> <h3 id="_polya_reads_sites_and_signals">PolyA reads, sites and signals</h3> <div class="sect3"> <h4 id="_raw_sites">Raw sites</h4> <div class="paragraph"> <p><a href="./data/polyA/raw/" target="_blank">These BED files</a> contain the coordinates of all raw polyA sites called using captured PacBio reads. The corresponding poly-adenylated PacBio reads are also identified.</p> </div> </div> <div class="sect3"> <h4 id="_clustered_sites">Clustered sites</h4> <div class="paragraph"> <p><a href="./data/polyA/clustered/" target="_blank">These BED files</a> contain the coordinates of all merged polyA sites (one record per non-redundant polyA site). The poly-adenylated PacBio reads contributing to each site are also identified.</p> </div> </div> <div class="sect3"> <h4 id="_polya_signals">PolyA signals</h4> <div class="paragraph"> <p>The sequences and genome coordinates of putative polyA signals used in the study are available <a href="./data/polyA/signals/" target="_blank">here</a>.</p> </div> </div> </div> <div class="sect2"> <h3 id="_splice_junctions">Splice junctions</h3> <div class="paragraph"> <p>PacBio canonical Splice Junctions (SJs) are available <a href="./data/introns/" target="_blank">here</a> (in GTF format). They were derived from the PacBio read mappings available <a href="./data/readGTFs/incl_non-DB-HCGMs/" target="_blank">here</a>.</p> </div> <div class="sect3"> <h4 id="_analysis_of_splicing_motifs">Analysis of splicing motifs</h4> <div class="paragraph"> <p>We ran <a href="http://genome.crg.es/software/geneid/" target="_blank">geneid</a> to score various sets of splice sites using <a href="./data/human.param.Feb_22_2006_GC" target="_blank">this parameter file</a>. The resulting sets are linked below:</p> </div> <div class="ulist"> <ul> <li> <p>All putative <code>GC|GT</code> donors + <code>AG</code> acceptors (whole genome): <a href="./data/introns/geneid/scores/whole-genome" target="_blank">link</a>.</p> </li> <li> <p>All control splice sites (in captured regions, with no evidence of splicing): <a href="./data/introns/geneid/scores/controls" target="_blank">link</a>.</p> </li> </ul> </div> </div> </div> <div class="sect2"> <h3 id="_capture_probe_coordinates">Capture probe coordinates</h3> <div class="paragraph"> <p>Provided in BED format:</p> </div> <div class="ulist"> <ul> <li> <p><a href="./data/trackHub/dataFiles/hs_Cap1_probes.bed" target="_blank">Human</a></p> </li> <li> <p><a href="./data/trackHub/dataFiles/mm_Cap1_probes.bed" target="_blank">Mouse</a></p> </li> </ul> </div> </div> </div> </div> <div class="sect1"> <h2 id="_software">Software</h2> <div class="sectionbody"> <div class="ulist"> <ul> <li> <p><a href="https://github.com/julienlag/samToPolyA" target="_blank"><code>samToPolyA</code></a>: Calls poly-adenylated reads and polyA sites from read alignments.</p> </li> <li> <p><a href="https://github.com/julienlag/matchDistribution" target="_blank"><code>matchDistribution</code></a>: Given distinct "subject" (S) and a "target" (T) distributions, attempts to mimic T’s density (i.e., its shape) by sampling from S’s population. Used in this project to expression-match sets of protein-coding TSSs to lncRNA ones.</p> </li> <li> <p><a href="https://github.com/julienlag/anchorTranscriptsEnds" target="_blank"><code>anchorTranscriptsEnds</code></a>: prepare a GTF file of mapped transcriptome reads for anchored transcript merging.</p> </li> <li> <p><a href="https://github.com/sdjebali/Compmerge" target="_blank"><code>compmerge</code></a>: used to merge the <span class="tooltip">HCGMs<span class="tooltiptext"><u>H</u>igh-<u>C</u>onfidence <u>G</u>enome <u>M</u>appings are PacBio read alignments with the following characteristics:<br>- <i>Uniquely mapped</i><br>- <i>Double-bounded</i>: the read contains both library adapters (<i>i.e.</i>, its cDNA insert is believed to be fully sequenced)<br>- If spliced, its constituting introns must all be canonical (GT|GC/AG)<br>- If unspliced, it must bear a detectable polyA tail</span></span> into a non-redundant set of transcript models.</p> </li> <li> <p><a href="https://github.com/pervouchine/ipsa-full" target="_blank"><code>IPSA</code></a> (Integrative Pipeline for Splicing Analyses), used to extract splice junctions and their read counts from alignments of Illumina HiSeq data.</p> </li> <li> <p><a href="https://github.com/sdjebali/Comptr" target="_blank"><code>comptr</code></a>: compare a set of transcripts to a reference annotation.</p> </li> </ul> </div> </div> </div> <div class="sect1"> <h2 id="_contact">Contact</h2> <div class="sectionbody"> <div class="paragraph"> <p>Questions, requests and comments about this study should be <a href="mailto:julien.lagarde%40crg.cat,barbara.uszczynska%40gmail.com,roderic.guigo%40crg.cat,rory.johnson%40dkf.unibe.ch?subject=CLS">sent to</a>:</p> </div> <div class="ulist"> <ul> <li> <p><strong>Julien Lagarde</strong> (<a href="http://www.crg.eu/en/roderic_guigo" target="_blank">CRG</a>, Barcelona, Spain)</p> </li> <li> <p><strong>Barbara Uszczynska-Ratajczak</strong> (<a href="http://www.crg.eu/en/roderic_guigo" target="_blank">CRG</a>, Barcelona, Spain)</p> </li> <li> <p><strong>Roderic Guig贸</strong> (<a href="http://www.crg.eu/en/roderic_guigo" target="_blank">CRG</a>, Barcelona, Spain)</p> </li> <li> <p><strong>Rory Johnson</strong> (<a href="https://gold-lab.org/" target="_blank">GOLD Lab</a>, University of Bern, Switzerland)</p> </li> </ul> </div> <hr> <table class="tableblock frame-none grid-all spread"> <colgroup> <col style="width: 25%;"> <col style="width: 25%;"> <col style="width: 25%;"> <col style="width: 25%;"> </colgroup> <tbody> <tr> <td class="tableblock halign-center valign-middle"><p class="tableblock"><span class="image"><a class="image" href="http://www.crg.eu/en/roderic_guigo"><img src="./img/crg_blue_logo.png" alt="CRG logo" width="160"></a></span></p></td> <td class="tableblock halign-center valign-middle"><p class="tableblock"><span class="image"><a class="image" href="http://www.sanger.ac.uk/science/groups/vertebrate-annotation"><img src="./img/wtsi_logo.png" alt="Sanger logo" width="160"></a></span></p></td> <td class="tableblock halign-center valign-middle"><p class="tableblock"><span class="image"><a class="image" href="http://www.qgenomics.com"><img src="./img/qgenomics_logo.png" alt="Qgenomics logo" width="160"></a></span></p></td> <td class="tableblock halign-center valign-middle"><p class="tableblock"><span class="image"><a class="image" href="https://www.cshl.edu/Faculty/Thomas-Gingeras.html"><img src="./img/cshl_logo.jpeg" alt="CSHL logo" width="160"></a></span></p></td> </tr> </tbody> </table> </div> </div> </div> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.9.1/styles/github.min.css"> <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.9.1/highlight.min.js"></script> <script>hljs.initHighlighting()</script> </body> </html>