Introduction

Some cancers are associated with the presence of viruses that disrupt chromosomal DNA. Among the best known examples are cervical cancers and head-and-neck cancers that are associated with human papilloma virus (HPV). Using whole genome short-read sequencing, it is possible to identify breakpoints, including those where viral DNA is merged with human DNA. Each breakpoint defines a local instance of “structural variation” (SV) where segments of chromosomes are joined together in non-normal ways. It remains an open question whether those fragments are fully integrated into a human chromosome or just exist as circular extrachromosomal DNA. Using newer long-read sequencing technology, one can obtain information about the order of multiple breakpoints in the original DNA molecules in the tumor sample. In this vignette, we describe SVAlignR, an R package intended to use these long-read sequences to reconstruct the original molecules.

Getting Started

We begin by loading the SVAlignR package, and a sample data set.

library(SVAlignR)
data(longreads)
head(longreads[, 1:3])
##                                    qname  qlen                    connection
## LR1 870fe397-c2e5-4fbc-8324-506484f4c89a 64349  50-74-0-0-50-74-0-50-74-0-50
## LR2 cc6314e0-3582-4f15-8d59-44c09bfef1e3 56447  74-0-63-74-0-50-74-0-50-74-0
## LR3 52f855b6-de11-49d8-b298-6460d5274a2f 56306 50-74-0-50-74-0-50-74-0-50-74
## LR4 55becc2c-c24c-49df-b73a-6459aabd2a05 55510 74-50-74-0-50-74-0-50-74-0-50
## LR5 591c04a7-e1f5-40ca-b707-82ce2cfd0ff2 54409 74-50-74-0-50-74-0-50-74-0-50
## LR6 b89fe96f-d773-494a-a702-7d7879219581 51084     0-50-74-0-50-74-0-50-74-0

Each of the 197 rows in this spreadsheet represents a different Oxford Nanopore long read. The qname column is a unique identifier; the qlen column contains the length of the read in bytes; and the connections column lists the breakpoints that the read contains, each of which has been assigned a numeric identifier. We are interested in this column. We have two main goals:

  1. Assemble the long-reads (which should be thought of as fragmentary) into complete molecules.
  2. Start constructing a phylogenetic tree showing the evolution of these HPV-human combining molecules.

As with many sequence alignment problems, we start by removing any duplicate “reads”; here that means duplication of the break-point sequence summaries of the actual reads.

LR0 <- longreads$connection
names(LR0) <- rownames(longreads)
LR <- LR0[!duplicated(LR0)]
weights <- sapply(LR, function(lr) sum(LR0 == lr))
length(LR) # expect 158
## [1] 158

Clustering Sequences

We want to cluster the break-point sequences.

Recoding

In order to make use of existing algorithms, we need to convert the numeric identifiers into single characters as part of an “alphabet”. The Cipher class takes in a character vector containing the “words” (that is, our break-point sequences) that need to be rewritten. By default, it assumes (as in our example) that “letters” are separated by hyphens, identifies the unique symbols, and creates a tool to convert back-and-forth from one “alphabet” to another.

alphabet <- Cipher(LR)
alphabet
## An object of class "Cipher"
## Slot "forward":
##   0  12  13  14  15  17  19  23  24  25  26  27  28  31  32  33  34  35  36  37  38  40 
## "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "P" "Q" "R" "S" "T" "V" "W" "X" 
##  45  46  48  50  51  52  54  56  57  58   6  60  61  63  65  66  70  71  74  75   x 
## "Y" "Z" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" 
## 
## Slot "reverse":
##    A    B    C    D    E    F    G    H    I    J    K    L    M    N    P    Q    R 
##  "0" "12" "13" "14" "15" "17" "19" "23" "24" "25" "26" "27" "28" "31" "32" "33" "34" 
##    S    T    V    W    X    Y    Z    a    b    c    d    e    f    g    h    i    j 
## "35" "36" "37" "38" "40" "45" "46" "48" "50" "51" "52" "54" "56" "57" "58"  "6" "60" 
##    k    l    m    n    o    p    q    r    s    -    ? 
## "61" "63" "65" "66" "70" "71" "74" "75"  "x"  ":"  "?"

Now we are able to convert the original sequences of numbers into words in our specialized alphabet.

head(LR)
##                             LR1                             LR2 
##  "50-74-0-0-50-74-0-50-74-0-50"  "74-0-63-74-0-50-74-0-50-74-0" 
##                             LR3                             LR4 
## "50-74-0-50-74-0-50-74-0-50-74" "74-50-74-0-50-74-0-50-74-0-50" 
##                             LR6                             LR7 
##     "0-50-74-0-50-74-0-50-74-0"              "74-0-0-0-50-74-0"
pseudo <- encode(alphabet, LR)
head(pseudo)
##           LR1           LR2           LR3           LR4           LR6           LR7 
## "bqAAbqAbqAb" "qAlqAbqAbqA" "bqAbqAbqAbq" "qbqAbqAbqAb"  "AbqAbqAbqA"     "qAAAbqA"

Clusters

The recoding step is already included in the SequenceCluster constructor and will happen automatically. So, to cluster the long-read break point sequences, we just need to call the constructor on the (named and de-duplicated) sequences. The core algorithm is the classic Needelman-Wunsch global alignment algorithm, as implemented in the NameNeedle R package. As used in the SVAlignR package, it computes an alignment score between any two sequences based on a match score of 2, a mismatch penalty of -6, and a gap penalty of -2. We fill a square matrix with scores for every pair (I, J) of sequences. The resulting scores can be either positive or negative. For each row X, we rescale into the interval [0,1] by computing \[Y_{IJ} = 1 - \frac{X_{IJ} - min(X_{I.})}{max(X_{I.})-min(X_{I.})}.\] Since the scores need not be symmetric, we force symmetry by averaging the scaled score matrix and its transpose.

We then perform hierarchical clustering using the symmetrized, rescaled score matrix to define distances. We cut the dendrogam to cluster the sequences into 20 groups. The number 20 is arbitrary. In our data set, it should yield about 8 long-read break-point sequences per cluster. For later steps, we also want to make sure that each cluster contains at leeast 2 sequences.

seqclust <- SequenceCluster(LR0, NC = 20)
## Warning in SequenceCluster(LR0, NC = 20): Removing 39 duplicated sequences.
table(seqclust@clusters)
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  9  2 17  5 14  6  7 10  7  8  4  5  9 10  7  5 18  7  4  4

Now we see why 20 clusters is appropriate. Each cluster is a relatively small size, but they all contain more than one sequence so that aligning those sequences will be meaningful. Figure 1 displays the resulting dendrogram, with each of the 20 clusters shown in a different color.

plot(seqclust)
Figure 1: Hierarchical clustering of break point sequences.

Figure 1: Hierarchical clustering of break point sequences.

Figure 2 contains a simplified view of the same dendrogram, cutting off the clutter below the major splits into clusters.

plot(seqclust, type = "clipped")
Figure 2: Uncluttered clusters.

Figure 2: Uncluttered clusters.

Figure 3 displays an alternate view of the dendrogram, emphasizing the relationships between clusters.

plot(seqclust, type = "unrooted")
Figure 3: Unrooted phylogeny tree.

Figure 3: Unrooted phylogeny tree.

Both Figure 2 and, more strongly, Figure 3 indicate that there are approximately four major clusters.

We can also show the dendrogram attached to a heatmap containing the distance matrix.

heat(seqclust)
Figure 4: Heatmap of the distance matrix.

Figure 4: Heatmap of the distance matrix.

Aligning Clustered Sequences

Each of the clusters uses a small enough set of symbols that we can safely rewrite them as though they are sequences of amino acids. We then use the align function from the SVAlignR package to perform multiple sequence alignments, using the “ClustalW” algorithm on each cluster. Because the components are not amino acids, however, we do not need to use one of the PAM or BLOSUM matrices. Instead, we define a substitution matrix that gives the same match and mismatch scores to each pair of symbols.

MYSUB <- makeSubsMatrix(match = 2, mismatch = -6)
consensus <- alignAllClusters(seqclust, mysub = MYSUB, gapO = 2, gapE = 0.1)
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
class(consensus)
## [1] "list"
length(consensus)
## [1] 20

Figure 5 contains an example of a multiple sequence alignment.

opar <- par(mai = c(1.02, 0.82, 0.82, 0.82))
image(consensus[[15]])
Figure 5: Example alignment of cluster 15.

Figure 5: Example alignment of cluster 15.

par(opar)
rm(opar)

Alignments By Cluster

Figure 6 shows all 20 alignments.

opar <- par(mfrow = c(4,3))
for (K in 1:length(consensus)) {
  image(consensus[[K]], col = SVAlignR:::myColorSet[K])
}
Figure 6: Multiple sequence alignments of clusters.

Figure 6: Multiple sequence alignments of clusters.

par(opar)
Figure 6: Multiple sequence alignments of clusters.

Figure 6: Multiple sequence alignments of clusters.

Consensus Sequences

library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree
myColorSet <- SVAlignR:::myColorSet
NC <- seqclust@NC
mycut <- mean(rev(seqclust@hc$height)[NC + -1:0])
hcd <- as.dendrogram(seqclust@hc)
dend2 <- cut(hcd, h = mycut)
DU <- dend2$upper
blocks <- sapply(consensus, function(X) {
  X@consensus
})
labels(DU) <- paste("   [", 1:NC, "]  ",  blocks, "   ", sep = "")

opar <- par(bg = "gray90")
plot(ape::as.phylo(DU), type = "unrooted", rotate.tree = -20,
     edge.color = "black", edge.width = 2,
     lab4ut = "axial", 
     tip.color = myColorSet[1:NC], cex = 1.2)
Figure 7: Unrooted phylogeny tree with consensus sequences.

Figure 7: Unrooted phylogeny tree with consensus sequences.

par(opar)

Appendix

sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                           LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dendextend_1.15.2 SVAlignR_0.2.2   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.2       xfun_0.31              bslib_0.3.1           
##  [4] purrr_0.3.4            lattice_0.20-45        colorspace_2.0-3      
##  [7] vctrs_0.4.1            generics_0.1.2         viridisLite_0.4.0     
## [10] htmltools_0.5.2        stats4_4.2.1           yaml_2.3.5            
## [13] utf8_1.2.2             rlang_1.0.3            jquerylib_0.1.4       
## [16] pillar_1.7.0           glue_1.6.2             BiocGenerics_0.42.0   
## [19] GenomeInfoDbData_1.2.8 lifecycle_1.0.1        stringr_1.4.0         
## [22] zlibbioc_1.42.0        Biostrings_2.64.0      munsell_0.5.0         
## [25] gtable_0.3.0           evaluate_0.15          knitr_1.39            
## [28] IRanges_2.30.0         fastmap_1.1.0          GenomeInfoDb_1.32.2   
## [31] parallel_4.2.1         fansi_1.0.3            highr_0.9             
## [34] msa_1.28.0             Rcpp_1.0.8.3           scales_1.2.0          
## [37] S4Vectors_0.34.0       scatterplot3d_0.3-41   jsonlite_1.8.0        
## [40] XVector_0.36.0         gridExtra_2.3          ggplot2_3.3.6         
## [43] digest_0.6.29          stringi_1.7.6          dplyr_1.0.9           
## [46] oompaBase_3.2.9        grid_4.2.1             cli_3.3.0             
## [49] tools_4.2.1            bitops_1.0-7           NameNeedle_1.2.6      
## [52] magrittr_2.0.3         sass_0.4.1             Polychrome_1.5.1      
## [55] RCurl_1.98-1.7         tibble_3.1.7           cluster_2.1.3         
## [58] ape_5.6-2              crayon_1.5.1           pkgconfig_2.0.3       
## [61] ellipsis_0.3.2         rmarkdown_2.14         viridis_0.6.2         
## [64] R6_2.5.1               igraph_1.3.2           nlme_3.1-157          
## [67] compiler_4.2.1