A worked example with atacR

About the experiment

The experiment we’ll be running through is an ATAC-cap-seq experiment of Arabidopsis plant leaves taken from plants exposed to either a mock (water) treatment or infected with a pathogen. We have the bare minimum replicate number, just three independent samples for each of the mock or infected treatments. The reads are paired-end and 50 nt long. The BAM files are sorted using samtools from SAM files generated by BWA but at the start no further processing has been done. The paths to the BAM files and the bait region coordinates on your system are described above.

Preparing input files

The first step of any atacR analysis is to build the input files, these are basically files listing where the important data files are on your system. You’ll need a file listing the BAM files and a file listing the bait windows. These are described more fully in the loading vignette.

The atacR package comes with a small set of ATAC-cap-seq data built in. It is installed along with the package, and we’ll use that data in this tutorial.

A convenience function is built into atacR that will find the built-in data and build the files you need to follow this tutorial. All you need to do is decide where those files should be written. Below we write them to the Desktop. Two files will appear on the Desktop - bait_regions.gff and sample_treatment_bam_mappings.csv. If you inspect these you’ll see the structure.

library(atacr)
input_files <- make_tutorial_data("~/Desktop/")
input_files

## $bait_regions_file
## [1] "~/Desktop//bait_regions.gff"
## 
## $mapping_file
## [1] "~/Desktop//sample_treatment_bam_mappings.csv"

The object input_files holds the paths to the input files we made, so we’ll use that to get going.

Generating read counts

Once we have the files ready, we can begin analysis. We can extract information from the files and make the counts we’re interested in with the make_counts() function. In this step we’ll set the read filter parameters to decide which reads and alignments in the BAM file are of sufficiently good quality to be counted.

Depending on whether you have ATAC-cap-seq or RNA-cap-seq this function does slightly different things. If you have ATAC-cap-seq, this function divides the bait regions in the genome into sub-windows of a fixed width. If you have RNA-cap-seq the whole bait region is considered to be a single window.

Below we will use the default window sizes (50nt, non-overlapping) and read filters (described in the loading vignette ). As this is ATAC-cap-seq data we need to specify that too.

counts <- make_counts(input_files$bait_regions_file, 
                      input_files$mapping_file,
                      is_rnaseq = FALSE
           )

The resulting object counts has a few slots containing information. The most important are bait_windows which describes the windows in the bait regions and non_bait_windows which describes all the spaces in between the bait_windows. By defauly all functions will work on bait_windows but you can change the subset using the which parameter (see the atacr which vignette for more information. )

Summarising data

Once everthing is loaded, it is a good idea to check the counts object is as you expect. The summary() function does this.

Summary statistics

summary(counts)

## ATAC-seq experiment of 2 treatments in 6 samples
##  Treatments: mock,infected 
##  Samples: mock_rep1,mock_rep2,mock_rep3,infected_rep1,infected_rep2,infected_rep3 
##  Bait regions used: 2219 
##  Total Windows: 2382930 
##  
##  On/Off target read counts:
##           sample off_target on_target percent_on_target
## 1 infected_rep1          0     60730         100.00000
## 2 infected_rep2          0     33375         100.00000
## 3 infected_rep3          0     55221         100.00000
## 4     mock_rep1          0     29925         100.00000
## 5     mock_rep2          5     52883          99.99055
## 6     mock_rep3          4     34373          99.98836 
##  Quantiles: 
##  $bait_windows
##     mock_rep1 mock_rep2 mock_rep3 infected_rep1 infected_rep2
## 1%       0.00       0.0         0          0.00          0.00
## 5%       0.00       0.0         0          0.00          0.00
## 95%     33.00      29.3        39          9.00         13.15
## 99%     55.03     281.0        65        421.03        156.03
##     infected_rep3
## 1%           0.00
## 5%           0.00
## 95%         22.15
## 99%        292.03
## 
## $non_bait_windows
##     mock_rep1 mock_rep2 mock_rep3 infected_rep1 infected_rep2
## 1%          0         0         0             0             0
## 5%          0         0         0             0             0
## 95%         0         0         0             0             0
## 99%         0         0         0             0             0
##     infected_rep3
## 1%              0
## 5%              0
## 95%             0
## 99%             0
##  
##  Read depths:
##           sample   off_target on_target
## 1 infected_rep1 0.000000e+00 12.926777
## 2 infected_rep2 0.000000e+00  7.104087
## 3 infected_rep3 0.000000e+00 11.754151
## 4     mock_rep1 0.000000e+00  6.369732
## 5     mock_rep2 2.102402e-06 11.256492
## 6     mock_rep3 1.681922e-06  7.316518

The summary is very long but worthwhile. A feature of atacR is that it keeps counts in non-bait region windows. Non-bait region windows are those outside the bait regions. The non-bait regions are not the same size as the bait window regions - A single non-bait window covers all the space between the last window of one bait region and the first window of the next.

The treatments line gives the two classes of data that atacR understands you have, here mock and infected.
The samples line gives the samples and replicate information
The Bait regions used line gives the bait region count
The Total Windows line tells how many windows those baits are divided into.
The On/Off target read counts section tells how many reads are in the windows (on_target) and how many are outside (off_target) for each sample
The Quantiles section shows the read count at each quantile for each sample in the windows in bait regions or non-bait regions
The Read depths section shows the on_target and off_target region average read depths.

As we can see the coverage in this small sample is relatively low - that’s an artefact of small files to keep the tutorial running quite quickly. But most windows have an average of ~ 10 counts and the off-target reads are very low. < 1 %.

Summary and QC plots

The atacR package has a range of summary and QC plots for visualising different aspects of the data.

The samples can be inspected through plots. The standard plot function creates a few summary style plots enabling you to view coverage distribution and region density. As it summarises windows, the more windows you have, the slower it runs!

plot(counts)

A coverage threshold plot can reveal the number of windows that have coverage lower than a given value.

windows_below_coverage_threshold_plot(counts)

Here we can see that mock_rep1 and mock_rep3 have fewer windows below the coverage threshold, so are generally better covered.

We can see from all these that although the read mapping and filtering is specific to the quite a lot of windows (~ 2000 in each sample) have counts of 0, which indicates that some of the DNA in the sequence regions was not sampled. You may wish to play with window size settings to see how robust this phenomenon is to window size. Increasing the window size will likely reduce the zero count windows number by merging counts from adjacent windows. Decreasing the window size will likely increase zero count windows. The level of granularity you use will be study dependent and if you intend to conclude absence of counts (e.g for detection of closed chromatin) then you’ll want to be very careful with comparison to specific control windows to make that comparison.

Specific window counts

You can examine specific window counts quite easily. The internal object holding the data is of class SummarizedExperiment, which is part of BioConductor, so you can use functions in standard BioConductor packages to interrogate them. Here’s how you might get information on specific window counts.

First you must create a region of interest. Use (GenomicRanges)[https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html] package to do this.

roi <- GenomicRanges::GRanges(seqnames = "Chr1", ranges = 245951:246250)

Next, subset the window set of interest with (IRanges)[https://bioconductor.org/packages/release/bioc/html/IRanges.html]

small_section <- IRanges::subsetByOverlaps(counts$bait_windows, roi)

The resulting object is a SummarizedExperiment which doesn’t print literally, as it can be quite big. To get at the actual count matrix, use the assay function.

SummarizedExperiment::assay(small_section)

##                    mock_rep1 mock_rep2 mock_rep3 infected_rep1
## Chr1:245951-246000         0         0         0             0
## Chr1:246001-246050         5         7        18             6
## Chr1:246051-246100         3         0        11             0
## Chr1:246101-246150         0         0         0             0
## Chr1:246151-246200        19         3        21             0
## Chr1:246201-246250         0         0         0             0
##                    infected_rep2 infected_rep3
## Chr1:245951-246000             0             0
## Chr1:246001-246050             4             1
## Chr1:246051-246100             9             7
## Chr1:246101-246150             0             0
## Chr1:246151-246200             6             9
## Chr1:246201-246250             0             0

Sample reproducibility

A PCA plot can be used to examine the similarity between the different samples. Here we can see that two of the infected replicates are way off from each other and the other more similar samples. You can use these plots to identify any samples that are extremely different from the others. As all of the infected samples are quite different in different ways, we may be seeing just a large amount of experimental variability in our results, which can be important too. So we’ll proceed with the data, keeping in mind that variability may be large and for particular treatments, we may need to gather more replicates.

sample_pca_plot(counts)

An MA plot can show you eccentricities in each sample (See the wiki page for more information)[https://en.wikipedia.org/wiki/MA_plot]. In the atacR MA plot a common reference is used, the median value for a windows as a common denominator for sample.

ma_plot(counts)

In this MA plot we see some structure in the data, the strong lines in each subplot indicate the points with zero for the count. The infected overall show higer counts than the mock. The usual assumption of most windows not changing between samples may not hold, here as the clouds of points seem quite shifted between mock and infected.

Normalisation

The normalisation step helps us to reduce systematic between-sample variability. Sequence data are hard to normalise, and cannot be normalised well by simple scaling. For RNASeq data there are numerous methods such as FPKM etc that sort of normalise. The best approaches with ATAC-cap-seq data are to find the least varying windows and use those to scale the rest of the data with.

atacR provides three types of normalisation. These are

Library size
Scale factor
Goodness of Fit

The best of these is 3. Goodness of Fit. It is fast, automatically finds the least varying and best features in the data to normalise on and does a reasonable job of between-sample normalisation. It is usually the best one to choose.

The Library size normalisation is the most basic and the one that most studies seem to think constitutes normalisation - the basis of this is that each count is divided by the mean count for all samples in that treatment the sample. For most ATAC purposes this will be underpowered, because of skew caused by the s

The Scale factor normalisation is provided to allow interaction with other normalisation from other packages. With this you provide a number for each sample and the counts in each sample are divided by the respective number.

Check out the normalisations vignette for further information.

Goodness of Fit normalisation

Here we’ll run Goodness of Fit (GoF) on the sample data. First step is to run the GoF code and find the most stable windows across the samples to use to normalise.

auto_controls <- find_controls_by_GoF(counts)

We can use these to check the selected control windows have lower GoF than the non-selected windows using the plot_GoF() function

plot_GoF(counts, controls = auto_controls)

They are better. They have a lower, spikier mean Goodness of Fit. The Non-control data has a long tail distribution so the difference is quite pronounced. So we can use now generate the normalisation factors and apply them. We’ll save the resulting information to a new slot in the counts object. Then we’ll plot the pre- and post- normalised data to see the effects of the normalisation

gof_norm_factors <- get_GoF_factors(counts)

gof_normalised_counts <- scale_factor_normalise(counts, 
                                          scaling_factors = gof_norm_factors)

counts$normalised_counts <- gof_normalised_counts


plot_counts(counts)

plot_counts(counts, which = "normalised_counts")

ma_plot(counts)

ma_plot(counts, which = "normalised_counts")

We can see that the distributions get a little closer to each other and that the spread in the data in MA plots is reduced a little. The variability in these data are quite high though. See the normalisations vignette for further discussion.

Differential window counts

Once you are happy with the normalisation, you can try to estimate which windows have differential counts. atacR gives you three methods.

edgeR exact test - this is a wrapper around the edgeR method for single factor designs, using the estimateDispersion method. This method was designed for genome wide studies so works best when only a few of the (~5 %) of the windows are expected to have differential counts. It is the most sensitive in this situation though.
bootstrap t test - this is a brute force method that uses resampling of each windows sample counts and recalculating of the Student’s t statistic to come up with a background distribution of t. If the observed t is at the edges of this distribution, differential counts are called. This method is useful when any number of the windows may show differential counts.
Bayes Factor test, this calculates the Bayes factor for each window. The ratio of the Bayes factor for control and test is returned on a window by window basis. If the ratio is over a given number (4 by default) a differential count is called. This method is useful when any number of the windows may show differential counts.

See the differential windows vignette for further discussion

We can perform differential analysis in the following ways, we’ll use the which argument (see which vignette ) to make sure we analyse the normalised counts.

 edgeRexact_result <-  edgeR_exact(counts,
                which = "normalised_counts",
                treatment_a =  "infected",
               treatment_b = "mock",
                remove_zeros = TRUE)

## Design matrix not provided. Switch to the classic mode.

bootstrap_result <- estimate_fdr(counts,
              which = "normalised_counts",
              treatment_a =  "infected",
              treatment_b = "mock",
              iterations = 10
)

bayesfactor_result <- estimate_bayes_factor(counts,
              treatment_a = "infected",
              treatment_b =  "mock",
              which = "normalised_counts"
              )

The resulting dataframe holds the result of these calculations

head(bootstrap_result)

##   infected_rep1 infected_rep2 infected_rep3 mock_rep1 mock_rep2 mock_rep3
## 1      0.000000      0.000000      0.000000  0.000000  0.000000  0.000000
## 2      7.318935      5.020414      1.251322  3.376434  8.717982 11.171746
## 3      0.000000     11.295931      8.759255  2.025860  0.000000  6.827178
## 4      0.000000      0.000000      0.000000  0.000000  0.000000  0.000000
## 5      0.000000      7.530620     11.261899 12.830448  3.736278 13.033704
## 6      0.000000      0.000000      0.000000  0.000000  0.000000  0.000000
##               window          t fdr is_sig mean_infected mean_mock
## 1 Chr1:245951-246000         NA  NA     NA      0.000000  0.000000
## 2 Chr1:246001-246050 -1.1112259 0.4  FALSE      4.530224  7.755387
## 3 Chr1:246051-246100  0.9391965 0.3  FALSE      6.685062  2.951013
## 4 Chr1:246101-246150         NA  NA     NA      0.000000  0.000000
## 5 Chr1:246151-246200 -0.7982347 0.1  FALSE      6.264173  9.866810
## 6 Chr1:246201-246250         NA  NA     NA      0.000000  0.000000
##   sd_infected  sd_mock log2_fold_change
## 1    0.000000 0.000000              NaN
## 2    3.063364 3.985808       -0.7756166
## 3    5.926738 3.506354        1.1797307
## 4    0.000000 0.000000              NaN
## 5    5.736768 5.310169       -0.6554596
## 6    0.000000 0.000000              NaN

head(bayesfactor_result)

##   infected_rep1 infected_rep2 infected_rep3 mock_rep1 mock_rep2 mock_rep3
## 1      0.000000      0.000000      0.000000  0.000000  0.000000  0.000000
## 2      7.318935      5.020414      1.251322  3.376434  8.717982 11.171746
## 3      0.000000     11.295931      8.759255  2.025860  0.000000  6.827178
## 4      0.000000      0.000000      0.000000  0.000000  0.000000  0.000000
## 5      0.000000      7.530620     11.261899 12.830448  3.736278 13.033704
## 6      0.000000      0.000000      0.000000  0.000000  0.000000  0.000000
##               window bayes_factor is_sig mean_infected mean_mock
## 1 Chr1:245951-246000           NA     NA      0.000000  0.000000
## 2 Chr1:246001-246050   -0.2564561  FALSE      4.530224  7.755387
## 3 Chr1:246051-246100   -0.3387662  FALSE      6.685062  2.951013
## 4 Chr1:246101-246150           NA     NA      0.000000  0.000000
## 5 Chr1:246151-246200   -0.3996749  FALSE      6.264173  9.866810
## 6 Chr1:246201-246250           NA     NA      0.000000  0.000000
##   sd_infected  sd_mock log2_fold_change
## 1    0.000000 0.000000              NaN
## 2    3.063364 3.985808       -0.7756166
## 3    5.926738 3.506354        1.1797307
## 4    0.000000 0.000000              NaN
## 5    5.736768 5.310169       -0.6554596
## 6    0.000000 0.000000              NaN

head(edgeRexact_result)

##   infected_rep1 infected_rep2 infected_rep3 mock_rep1 mock_rep2 mock_rep3
## 1      0.000000      0.000000      0.000000  0.000000  0.000000  0.000000
## 2      7.318935      5.020414      1.251322  3.376434  8.717982 11.171746
## 3      0.000000     11.295931      8.759255  2.025860  0.000000  6.827178
## 4      0.000000      0.000000      0.000000  0.000000  0.000000  0.000000
## 5      0.000000      7.530620     11.261899 12.830448  3.736278 13.033704
## 6      0.000000      0.000000      0.000000  0.000000  0.000000  0.000000
##               window   p_value is_sig mean_infected mean_mock sd_infected
## 1 Chr1:245951-246000        NA     NA      0.000000  0.000000    0.000000
## 2 Chr1:246001-246050 0.1669629  FALSE      4.530224  7.755387    3.063364
## 3 Chr1:246051-246100 1.0000000  FALSE      6.685062  2.951013    5.926738
## 4 Chr1:246101-246150        NA     NA      0.000000  0.000000    0.000000
## 5 Chr1:246151-246200 0.1723480  FALSE      6.264173  9.866810    5.736768
## 6 Chr1:246201-246250        NA     NA      0.000000  0.000000    0.000000
##    sd_mock log2_fold_change
## 1 0.000000              NaN
## 2 3.985808       -0.7756166
## 3 3.506354        1.1797307
## 4 0.000000              NaN
## 5 5.310169       -0.6554596
## 6 0.000000              NaN

Each of these methods works on single factor designs, there is a multiclass variant that works on common control designs. See the differential windows vignette for further discussion of these.

bf_multi <- estimate_bayes_factor_multiclass(counts, "mock",
              factor = 0.5,
              which = "normalised_counts"
              )

head(bf_multi)

##               window bayes_factor is_sig   mean_a   mean_b     sd_a
## 1 Chr1:245951-246000           NA     NA 0.000000 0.000000 0.000000
## 2 Chr1:246001-246050   -0.2564561  FALSE 4.530224 7.755387 3.063364
## 3 Chr1:246051-246100   -0.3387662  FALSE 6.685062 2.951013 5.926738
## 4 Chr1:246101-246150           NA     NA 0.000000 0.000000 0.000000
## 5 Chr1:246151-246200   -0.3996749  FALSE 6.264173 9.866810 5.736768
## 6 Chr1:246201-246250           NA     NA 0.000000 0.000000 0.000000
##       sd_b log2_fold_change        a    b
## 1 0.000000              NaN infected mock
## 2 3.985808       -0.7756166 infected mock
## 3 3.506354        1.1797307 infected mock
## 4 0.000000              NaN infected mock
## 5 5.310169       -0.6554596 infected mock
## 6 0.000000              NaN infected mock

fdr_multi <- estimate_fdr_multiclass(counts, "mock",
             fdr_level = 0.05,
              which = "normalised_counts"
             )

head(fdr_multi)

##               window fdr is_sig   mean_a   mean_b     sd_a     sd_b
## 1 Chr1:245951-246000  NA     NA 0.000000 0.000000 0.000000 0.000000
## 2 Chr1:246001-246050 0.2  FALSE 4.530224 7.755387 3.063364 3.985808
## 3 Chr1:246051-246100 0.2  FALSE 6.685062 2.951013 5.926738 3.506354
## 4 Chr1:246101-246150  NA     NA 0.000000 0.000000 0.000000 0.000000
## 5 Chr1:246151-246200 0.5  FALSE 6.264173 9.866810 5.736768 5.310169
## 6 Chr1:246201-246250  NA     NA 0.000000 0.000000 0.000000 0.000000
##   log2_fold_change        a    b
## 1              NaN infected mock
## 2       -0.7756166 infected mock
## 3        1.1797307 infected mock
## 4              NaN infected mock
## 5       -0.6554596 infected mock
## 6              NaN infected mock

The edgeR_multiclass() function does not return a dataframe, instead it returns the native DGELRT objects (see the DGELRT manual for more information) from each comparison in a list() object with names as per the treatment used.

edgeR_multiclass(counts,"mock", 
  remove_zeros = TRUE, 
  which = "normalised_counts")

## $infected
## An object of class "DGELRT"
## $coefficients
##                    (Intercept)  treatment2
## Chr1:246001-246050   -9.433027  1.21940472
## Chr1:246051-246100   -8.946885 -0.00713409
## Chr1:246151-246200   -9.069668  1.31352305
## Chr1:246251-246300   -9.838327  2.00119141
## Chr1:246401-246450   -8.881706  0.90716777
## 2143 more rows ...
## 
## $fitted.values
##                    mock_rep1 mock_rep2 mock_rep3 infected_rep1
## Chr1:246001-246050  5.422236 17.672178  5.724296      5.736514
## Chr1:246051-246100  2.554802  8.326624  2.697124      9.445346
## Chr1:246151-246200  8.598314 28.023665  9.077305      8.334501
## Chr1:246251-246300  7.925342 25.830313  8.366844      3.759020
## Chr1:246401-246450  6.901091 22.492069  7.285534     10.099912
##                    infected_rep2 infected_rep3
## Chr1:246001-246050      3.243761      5.350834
## Chr1:246051-246100      5.340953      8.810313
## Chr1:246151-246200      4.712816      7.774153
## Chr1:246251-246300      2.125571      3.506293
## Chr1:246401-246450      5.711083      9.420871
## 2143 more rows ...
## 
## $deviance
## Chr1:246001-246050 Chr1:246051-246100 Chr1:246151-246200 
##           2.792825          14.106727           6.936371 
## Chr1:246251-246300 Chr1:246401-246450 
##           1.951665           3.003295 
## 2143 more elements ...
## 
## $method
## [1] "oneway"
## 
## $unshrunk.coefficients
##                    (Intercept)   treatment2
## Chr1:246001-246050   -9.466047  1.242723488
## Chr1:246051-246100   -8.967376 -0.008480493
## Chr1:246151-246200   -9.092495  1.330229294
## Chr1:246251-246300   -9.888740  2.044973953
## Chr1:246401-246450   -8.900372  0.918219636
## 2143 more rows ...
## 
## $df.residual
## [1] 4 4 4 4 4
## 2143 more elements ...
## 
## $design
##   (Intercept) treatment2
## 1           1          1
## 2           1          1
## 3           1          1
## 4           1          0
## 5           1          0
## 6           1          0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$treatment
## [1] "contr.treatment"
## 
## 
## $offset
##       [,1]     [,2]     [,3]    [,4]     [,5]    [,6]
## x 9.913832 11.09531 9.968043 11.2129 10.64278 11.1433
## attr(,"class")
## [1] "compressedMatrix"
## attr(,"repeat.row")
## [1] TRUE
## attr(,"repeat.col")
## [1] FALSE
## 
## $dispersion
## [1] 0.7808653 0.6432634 1.0616874 0.9388462 0.9909758
## 2143 more elements ...
## 
## $prior.count
## [1] 0.125
## 
## $AveLogCPM
## [1] 7.729668 7.437319 8.261176 8.014694 8.118188
## 2143 more elements ...
## 
## $df.residual.zeros
## [1] 4 4 4 4 4
## 2143 more elements ...
## 
## $df.prior
## [1] Inf
## 
## $var.post
## Chr1:246001-246050 Chr1:246051-246100 Chr1:246151-246200 
##           1.191134           1.191786           1.258854 
## Chr1:246251-246300 Chr1:246401-246450 
##           1.217722           1.233231 
## 2143 more elements ...
## 
## $var.prior
## Chr1:246001-246050 Chr1:246051-246100 Chr1:246151-246200 
##           1.191134           1.191786           1.258854 
## Chr1:246251-246300 Chr1:246401-246450 
##           1.217722           1.233231 
## 2143 more elements ...
## 
## $samples
##               group lib.size norm.factors
## mock_rep1         1 20207.96            1
## mock_rep2         1 65861.86            1
## mock_rep3         1 21333.69            1
## infected_rep1     1 74079.82            1
## infected_rep2     1 41889.08            1
## infected_rep3     1 69099.26            1
## 
## $table
##                          logFC   logCPM            F     PValue
## Chr1:246001-246050  1.75922914 7.729668 1.9360298799 0.16413628
## Chr1:246051-246100 -0.01029232 7.437319 0.0001004109 0.99200514
## Chr1:246151-246200  1.89501319 8.261176 1.6737521988 0.19579149
## Chr1:246251-246300  2.88710892 8.014694 4.0529890265 0.04412434
## Chr1:246401-246450  1.30876644 8.118188 0.8950651751 0.34413498
## 2143 more rows ...
## 
## $comparison
## [1] "treatment2"
## 
## $df.test
## [1] 1 1 1 1 1
## 2143 more elements ...
## 
## $df.total
## [1] 8592 8592 8592 8592 8592
## 2143 more elements ...