Some cancers are associated with the presence of viruses that disrupt chromosomal DNA. Among the best known examples are cervical cancers and head-and-neck cancers that are associated with human papilloma virus (HPV). Using whole genome short-read sequencing, it is possible to identify breakpoints, including those where viral DNA is merged with human DNA. Each breakpoint defines a local instance of “structural variation” (SV) where segments of chromosomes are joined together in non-normal ways. It remains an open question whether those fragments are fully integrated into a human chromosome or just exist as circular extrachromosomal DNA. Using newer long-read sequencing technology, one can obtain information about the order of multiple breakpoints in the original DNA molecules in the tumor sample. In this vignette, we describe SVAlignR, an R package intended to use these long-read sequences to reconstruct the original molecules.
We begin by loading the SVAlignR
package, and a sample data set.
## qname qlen connection
## LR1 870fe397-c2e5-4fbc-8324-506484f4c89a 64349 50-74-0-0-50-74-0-50-74-0-50
## LR2 cc6314e0-3582-4f15-8d59-44c09bfef1e3 56447 74-0-63-74-0-50-74-0-50-74-0
## LR3 52f855b6-de11-49d8-b298-6460d5274a2f 56306 50-74-0-50-74-0-50-74-0-50-74
## LR4 55becc2c-c24c-49df-b73a-6459aabd2a05 55510 74-50-74-0-50-74-0-50-74-0-50
## LR5 591c04a7-e1f5-40ca-b707-82ce2cfd0ff2 54409 74-50-74-0-50-74-0-50-74-0-50
## LR6 b89fe96f-d773-494a-a702-7d7879219581 51084 0-50-74-0-50-74-0-50-74-0
Each of the 197 rows in this spreadsheet represents a different Oxford Nanopore long read. The qname
column is a unique identifier; the qlen
column contains the length of the read in bytes; and the connections
column lists the breakpoints that the read contains, each of which has been assigned a numeric identifier. We are interested in this column. We have two main goals:
As with many sequence alignment problems, we start by removing any duplicate “reads”; here that means duplication of the break-point sequence summaries of the actual reads.
LR0 <- longreads$connection
names(LR0) <- rownames(longreads)
LR <- LR0[!duplicated(LR0)]
weights <- sapply(LR, function(lr) sum(LR0 == lr))
length(LR) # expect 158
## [1] 158
We want to cluster the break-point sequences.
In order to make use of existing algorithms, we need to convert the numeric identifiers into single characters as part of an “alphabet”. The Cipher
class takes in a character vector containing the “words” (that is, our break-point sequences) that need to be rewritten. By default, it assumes (as in our example) that “letters” are separated by hyphens, identifies the unique symbols, and creates a tool to convert back-and-forth from one “alphabet” to another.
## An object of class "Cipher"
## Slot "forward":
## 0 12 13 14 15 17 19 23 24 25 26 27 28 31 32 33 34 35 36 37 38 40
## "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "P" "Q" "R" "S" "T" "V" "W" "X"
## 45 46 48 50 51 52 54 56 57 58 6 60 61 63 65 66 70 71 74 75 x
## "Y" "Z" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
##
## Slot "reverse":
## A B C D E F G H I J K L M N P Q R
## "0" "12" "13" "14" "15" "17" "19" "23" "24" "25" "26" "27" "28" "31" "32" "33" "34"
## S T V W X Y Z a b c d e f g h i j
## "35" "36" "37" "38" "40" "45" "46" "48" "50" "51" "52" "54" "56" "57" "58" "6" "60"
## k l m n o p q r s - ?
## "61" "63" "65" "66" "70" "71" "74" "75" "x" ":" "?"
Now we are able to convert the original sequences of numbers into words in our specialized alphabet.
## LR1 LR2
## "50-74-0-0-50-74-0-50-74-0-50" "74-0-63-74-0-50-74-0-50-74-0"
## LR3 LR4
## "50-74-0-50-74-0-50-74-0-50-74" "74-50-74-0-50-74-0-50-74-0-50"
## LR6 LR7
## "0-50-74-0-50-74-0-50-74-0" "74-0-0-0-50-74-0"
## LR1 LR2 LR3 LR4 LR6 LR7
## "bqAAbqAbqAb" "qAlqAbqAbqA" "bqAbqAbqAbq" "qbqAbqAbqAb" "AbqAbqAbqA" "qAAAbqA"
The recoding step is already included in the SequenceCluster
constructor and will happen automatically. So, to cluster the long-read break point sequences, we just need to call the constructor on the (named and de-duplicated) sequences. The core algorithm is the classic Needelman-Wunsch global alignment algorithm, as implemented in the NameNeedle
R package. As used in the SVAlignR
package, it computes an alignment score between any two sequences based on a match score of 2, a mismatch penalty of -6, and a gap penalty of -2. We fill a square matrix with scores for every pair (I, J) of sequences. The resulting scores can be either positive or negative. For each row X, we rescale into the interval [0,1] by computing \[Y_{IJ} = 1 - \frac{X_{IJ} - min(X_{I.})}{max(X_{I.})-min(X_{I.})}.\] Since the scores need not be symmetric, we force symmetry by averaging the scaled score matrix and its transpose.
We then perform hierarchical clustering using the symmetrized, rescaled score matrix to define distances. We cut the dendrogam to cluster the sequences into 20 groups. The number 20 is arbitrary. In our data set, it should yield about 8 long-read break-point sequences per cluster. For later steps, we also want to make sure that each cluster contains at leeast 2 sequences.
## Warning in SequenceCluster(LR0, NC = 20): Removing 39 duplicated sequences.
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 9 2 17 5 14 6 7 10 7 8 4 5 9 10 7 5 18 7 4 4
Now we see why 20 clusters is appropriate. Each cluster is a relatively small size, but they all contain more than one sequence so that aligning those sequences will be meaningful. Figure 1 displays the resulting dendrogram, with each of the 20 clusters shown in a different color.
Figure 2 contains a simplified view of the same dendrogram, cutting off the clutter below the major splits into clusters.
Figure 3 displays an alternate view of the dendrogram, emphasizing the relationships between clusters.
Both Figure 2 and, more strongly, Figure 3 indicate that there are approximately four major clusters.
We can also show the dendrogram attached to a heatmap containing the distance matrix.
Each of the clusters uses a small enough set of symbols that we can safely rewrite them as though they are sequences of amino acids. We then use the align
function from the SVAlignR
package to perform multiple sequence alignments, using the “ClustalW” algorithm on each cluster. Because the components are not amino acids, however, we do not need to use one of the PAM or BLOSUM matrices. Instead, we define a substitution matrix that gives the same match and mismatch scores to each pair of symbols.
MYSUB <- makeSubsMatrix(match = 2, mismatch = -6)
consensus <- alignAllClusters(seqclust, mysub = MYSUB, gapO = 2, gapE = 0.1)
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## [1] "list"
## [1] 20
Figure 5 contains an example of a multiple sequence alignment.
Figure 6 shows all 20 alignments.
opar <- par(mfrow = c(4,3))
for (K in 1:length(consensus)) {
image(consensus[[K]], col = SVAlignR:::myColorSet[K])
}
##
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
myColorSet <- SVAlignR:::myColorSet
NC <- seqclust@NC
mycut <- mean(rev(seqclust@hc$height)[NC + -1:0])
hcd <- as.dendrogram(seqclust@hc)
dend2 <- cut(hcd, h = mycut)
DU <- dend2$upper
blocks <- sapply(consensus, function(X) {
X@consensus
})
labels(DU) <- paste(" [", 1:NC, "] ", blocks, " ", sep = "")
opar <- par(bg = "gray90")
plot(ape::as.phylo(DU), type = "unrooted", rotate.tree = -20,
edge.color = "black", edge.width = 2,
lab4ut = "axial",
tip.color = myColorSet[1:NC], cex = 1.2)
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dendextend_1.15.2 SVAlignR_0.2.2
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.2 xfun_0.31 bslib_0.3.1
## [4] purrr_0.3.4 lattice_0.20-45 colorspace_2.0-3
## [7] vctrs_0.4.1 generics_0.1.2 viridisLite_0.4.0
## [10] htmltools_0.5.2 stats4_4.2.1 yaml_2.3.5
## [13] utf8_1.2.2 rlang_1.0.3 jquerylib_0.1.4
## [16] pillar_1.7.0 glue_1.6.2 BiocGenerics_0.42.0
## [19] GenomeInfoDbData_1.2.8 lifecycle_1.0.1 stringr_1.4.0
## [22] zlibbioc_1.42.0 Biostrings_2.64.0 munsell_0.5.0
## [25] gtable_0.3.0 evaluate_0.15 knitr_1.39
## [28] IRanges_2.30.0 fastmap_1.1.0 GenomeInfoDb_1.32.2
## [31] parallel_4.2.1 fansi_1.0.3 highr_0.9
## [34] msa_1.28.0 Rcpp_1.0.8.3 scales_1.2.0
## [37] S4Vectors_0.34.0 scatterplot3d_0.3-41 jsonlite_1.8.0
## [40] XVector_0.36.0 gridExtra_2.3 ggplot2_3.3.6
## [43] digest_0.6.29 stringi_1.7.6 dplyr_1.0.9
## [46] oompaBase_3.2.9 grid_4.2.1 cli_3.3.0
## [49] tools_4.2.1 bitops_1.0-7 NameNeedle_1.2.6
## [52] magrittr_2.0.3 sass_0.4.1 Polychrome_1.5.1
## [55] RCurl_1.98-1.7 tibble_3.1.7 cluster_2.1.3
## [58] ape_5.6-2 crayon_1.5.1 pkgconfig_2.0.3
## [61] ellipsis_0.3.2 rmarkdown_2.14 viridis_0.6.2
## [64] R6_2.5.1 igraph_1.3.2 nlme_3.1-157
## [67] compiler_4.2.1