Merging CAGE experiments

From Muscle
Jump to: navigation, search

Merging CAGE experiments. This page describes the problem of merging independent CAGE-seq experiments and approaches to solving it.

Problem statement

Transcription of genes begins at genomic positions called transcription start sites (TSS). CAGE is a high-throughput transcriptome analysis technique that can identify active TSSs with one base resolution and their relative activities. It was shown by CAGE method that different sets of TSSs can operate under different conditions, and that transcription can start from several closely spaced TSSs within the promoter. All this complicates the comparative analysis of CAGE experiments carried out in different conditions. We have developed a method that allow us to combine independent CAGE experiments and to obtain a pooled set of TSSs with accurately defined boundaries. Iterative application of this method to a large set of CAGE experiments allows the construction of a reference TSS set. The presence of such a reference set makes it easy to compare TSS activities in different experiments, as well as to identify previously unknown TSS in the incoming data.

Algorithm overview

The method accepts two data sets (Reference and NewData) as input. Each of the sets consists of CAGE peaks and a corresponding full genome profile of the 5' ends of CAGE reads. The result is a set of non-overlapping NewReference peaks that reflect all TSSs from the input sets. If we intersect two sets of CAGE peaks (Reference and NewData) at genomic coordinates, the following types of peaks can be identified (Fig. 1):

  1. Previously unknown - NewData peaks do not intersect with Reference peaks.
  2. Not active in NewData - Reference peaks do not intersect with NewData.
  3. Previously known, active in NewData - intersecting peaks Reference and NewData.


Figure 1


Figure 1.


The first two types of peaks go into the NewReference set without changes, and for the third, it is necessary to clarify the boundaries, since the intersection can be partial. When there is a partial intersection of the Reference and NewData peaks, overhanging ends (Fig. 2) as well as multiple intersections (Fig. 3, Fig. 4) can be observed.

Figure 2. CAGE peak overhangs.


Figure 2. CAGE peak overhangs.


Figure 3. One reference peak to several new peaks.


Figure 3. One reference peak to several new peaks.


Figure 4. One new peak to several reference peaks.


Figure 4. One new peak to several reference peaks.


A preliminary analysis of the rat CAGE data (Reference = FANTOM5, NewData = UEXP) showed a significant length of the overhangs (the average overhang length is 55% of the peak length for FANTOM5 peaks and 12% for UEXP peaks) as well as a significant proportion of multiple intersections (7.5% FANTOM5 peaks intersect with more than one UEXP peak and 1.1% of UEXP peaks intersect with more than one FANTOM5 peak). The presence of overhangs and multiple intersections can be caused both by the actual TSS activity in these regions in only one of the datasets or by the insufficient accuracy of determining the boundaries with the peak-caller due to insufficient coverage by reads. It is important to be able to distinguish between these cases, since in the first case this region should be included in the NewReference and in the second should not. This requires an analysis of the read density profile in these regions. To do this, the entire genome is divided into segments defined by the boundaries of the peaks of both datasets (Fig. 5).


Figure 5. Segmentation of CAGE peaks


Figure 5. Segmentation of CAGE peaks

Segments can be of four types:

  1. Confirmed in Reference and NewData (+/+).
  2. Confirmed only in Reference (+/-)
  3. Confirmed only in NewData (-/+)
  4. Not confirmed.

Segments of the first type always go to the NewReference, and segments of type 2 and 3 only after analyzing the profile of the 5' ends of the CAGE reads in these regions. For segments of the second type (+/-), the profile of Reference data set is analyzed, and for segments of the third type (-/+) - the NewData profile. A segment of type 2 or 3 is discarded if the density in this segment is significantly less than in the adjacent +/+ segment (if any) (Fig. 6).


Figure 6.


Figure 6.


To compare the density of reads in segments, a binomial test is used, where p is the ratio of the length of the first segment to the sum of the lengths of the segments, n is the sum of the number of reads in the segments, k is the number of reads in the first segment. If p-value <= 0.01, then it is considered that the density is significantly less and the segment is discarded. At the last stage, adjacent segments that should be included in the NewReference are combined into one.

Results on rn6 and hg38 data

This method of combining CAGE data was applied to combine the FANTOM5 and UEXP data from Rattus norvegicus, as well as to combine the HumanMuscle and RefTSS data, resulting in the identification of new TSS and refining the boundaries of the previously known TSS. Detailed statistics are given in Tables 1 and 2.


Table 1. Rattus norvegicus CAGE datasets statistics.
Number of peaks Bases count
FANTOM5(rn6) 28497 458kb
UEXP 34141 482kb
Combined 48859 644kb

Summary of changes made to FANTOM5(rn6) when combining with UEXP data:

  • Changed peaks: 8058
  • Deleted peaks: 0
  • New peaks: 19017
  • Splitted peaks: 2805
  • Joined peaks: 135
  • Unchanged peaks: 18844



Table 2. Homo sapiens CAGE datasets statistics.
Number of peaks Bases count
REFTSS(hg38) 224694 4315kb
HumanMuscle 41767 536kb
Combined 242109 3987kb

Summary of changes made to REFTSS(human) when combining with HumanMuscle dataset:

  • Changed peaks: 14061
  • Deleted peaks: 0
  • New peaks: 10407
  • Splitted peaks: 13109
  • Joined peaks: 365
  • Unchanged peaks: 204167