The ALPACA algorithm
We introduce ALPACA, an algorithm to infer allele-specific copy-number profiles of every tumour clone, given three inputs that can be inferred by existing methods: estimates of allele-specific fractional copy number from multiple bulk tumour samples; a fixed number of clones and the proportion of each clone in each tumour sample; and a fixed topology of the tumour phylogenetic tree representing the phylogenetic relationships between all clones. We provide an overview of the model underlying ALPACA, the formulation of the problem that ALPACA aims to solve, and the ALPACA algorithms used to solve it. Additional details can be found in the Supplementary Information.
Copy-number evolutionary model
We assume that a tumour is composed of \(n\) distinct tumour clones with their genome partitioned into \(m\) copy-number segments owing to the effect of the SCNAs that have been accumulated through tumour evolution. As such, each tumour clone \(i\in \{1,\,…,{n}\}\) in each genomic segment \(s\in \{1,\,…,{m}\}\) has a number \({c}_{s,i}^{{\rm{A}}}\) of copies for allele A, and \({c}_{s,i}^{{\rm{B}}}\) for allele B. For a normal diploid clone \(i=0\), we know that \({c}_{s,0}^{{\rm{A}}}\,=\,{c}_{s,0}^{{\rm{B}}}\,=\,1\). Instead, for any tumour clone \(i\in \{1,\,…,{n}\}\), the copy numbers \({c}_{s,i}^{{\rm{A}}}\) and \({c}_{s,i}^{{\rm{B}}}\) are unknown, and ALPACA aims to infer them.
We model the evolution of the \(n\) tumour clones with a phylogenetic tree \(T=(V,{E})\), in which \(V\) represents the set of nodes and \(E\) represents the related set of edges, which also defines the topology of \(T\). Specifically, each clone \(i\) is represented by a vertex \({v}_{i}\) of the tree; that is, \({v}_{i}\in V\). If tumour clone \(i\) is a parent of clone \(j\), then \(({v}_{i},{v}_{j})\) is an edge of \(T\); that is, \(({v}_{i},{v}_{j})\in E\). We assume, without loss of generality, that the MRCA of the tumour clones is clone \(i=1\). We also assume that the normal clone \(i=0\) is the root vertex of the tree \({v}_{0}\in V\), such that \(({v}_{0},{v}_{1})\in E\).
As we model SCNA evolution of the \(n\) tumour clones, each node of the tree \(T\) is labelled by related copy numbers for all segments. Specifically, for each tumour clone \(i\in \{1,\,…,{n}\}\), the corresponding vertex \({v}_{i}\in V\) is labelled in each segment \(s\in \{1,\,…,{m}\}\) by the copy number \({c}_{s,i}^{{\rm{A}}}\in {\mathbb{N}}\) for allele A and \({c}_{s,i}^{{\rm{B}}}\in {\mathbb{N}}\) for allele B. The normal root clone at vertex \({v}_{0}\) is labelled by \({c}_{s,0}^{{\rm{A}}}=\,1\) for allele A and \({c}_{s,0}^{{\rm{B}}}=1\) for allele B. Each edge of the tree \(({v}_{i},{v}_{j})\in E\) is thus labelled with events that are required to transform the copy numbers of clone \(i\) to those of clone \(j\). In particular, similarly to previous copy-number evolutionary models14,15,17, we model each copy-number event for each segment independently as an event that either increases or decreases a copy number by one unit. As such, the label of each edge corresponds to \({| {c}_{s,i}^{{\rm{A}}}-{c}_{s,j}^{{\rm{A}}}| }_{1}\,+\) \({| {c}_{s,i}^{{\rm{B}}}-{c}_{s,j}^{{\rm{B}}}| }_{1}\) events for each genomic segment \(s\), as \({| {c}_{s,i}^{{\rm{A}}}-{c}_{s,j}^{{\rm{A}}}| }_{1}\) events are required for allele A of each segment \(s\) and \({| {c}_{s,i}^{{\rm{B}}}-{c}_{s,j}^{{\rm{B}}}| }_{1}\) for allele B.
Given a fixed topology for a tree \(T\), an arbitrarily high number of copy-number labellings for \(T\) can be defined in general to explain SCNA evolution of the \(n\) tumour clones. Therefore, we introduce an evolutionary model based on three biologically realistic assumptions that define the best copy-number-labelled tree \(T\) to represent SCNA evolution. First, similarly to previous models of SCNA evolution4,31, we assume that LOH events are irreversible; that is, if all of the copies of one allele in any segment are lost, these cannot be regained by any descendant. Specifically, for any segment \(s\in \{1,\,…,{m}\}\) in any edge \(({v}_{i},{v}_{j})\in E\) it holds that \({c}_{s,i}^{{\rm{A}}}=0\,\Rightarrow {c}_{s,j}^{{\rm{A}}}=\,0\) and \({c}_{s,i}^{{\rm{B}}}=0\,\Rightarrow {c}_{s,j}^{{\rm{B}}}=\,0\). Second, we limit the changes between gains and losses of the same genomic segment that occur along each evolutionary path to prevent overfitting (by default, the upper bound is 2; that is, only one type of change (negative change (loss) or positive change (gain)) can happen multiple times on a single path), whereby a path refers to the sequence of edges \(\{({v}_{0},{v}_{1}),\,…,({v}_{i-1},{v}_{i})\}\) connecting the diploid root clone \({v}_{0}\) to any node \({v}_{i}\). Last, under the assumption of parsimony, we define the tree with the least number of copy-number events that explain the input data to be the more likely explanation of SCNA evolution, similarly to previous approaches4,31.
Model of the sequencing experiment
Each bulk tumour sample comprises an unknown mixture of the \(n\) tumour clones. When sequencing \(k\) bulk tumour samples, we thus represent the proportion of each clone \(i\in \{0,\,…,{n}\}\) in each tumour sample \(r\in \{1,\,…,{k}\}\) as \({u}_{i,r}\in [0,\,1]\). Specifically, \({u}_{i,r} > 0\) indicates that clone \(i\) is present in sample \(r\), and absent otherwise. We assume that, within each tumour sample \(r\), the tumour purity (that is, the proportion of all tumour clones) is \({\sum }_{i=1}^{n}{u}_{i,r}=1-{u}_{0,r}\), in which \({u}_{0,r}\) represents the proportion of normal cells in sample \(r\).
We cannot directly observe the copy numbers \({c}_{s,i}^{{\rm{A}}}\) and \({c}_{s,i}^{{\rm{B}}}\) of each tumour clone \(i\) in each tumour sample from bulk DNA-sequencing data. However, existing methods28,29,30,45,46 can provide estimates of the allele-specific, fractional copy numbers \({\widehat{f}}_{s,r}^{{\rm{A}}}\), \({\widehat{f}}_{s,r}^{{\rm{B}}}\) of each segment \(s\) in each tumour sample \(r\), such that \({\widehat{f}}_{s,r}^{{\rm{A}}},{\widehat{f}}_{s,r}^{{\rm{B}}}\in {{\mathbb{R}}}^{\ge 0}\) are real-valued observed estimates of the average copy number of all cancer cells in the sample \({f}_{s,r}^{{\rm{A}}}=\,{\sum }_{i=1}^{n}{c}_{s,i}^{{\rm{A}}}{u}_{i,r}\) and \({f}_{s,r}^{{\rm{B}}}=\,{\sum }_{i=1}^{n}{c}_{s,i}^{{\rm{B}}}{u}_{i,r}\), respectively. Using existing methods, we can also observe estimates of the upper bound, \({f}_{s,r}^{{\rm{A}},+},\,{f}_{s,r}^{{\rm{B}},+}\in {{\mathbb{R}}}^{\ge 0}\), and lower bound, \({f}_{s,r}^{{\rm{A}},-},\,{f}_{s,r}^{{\rm{B}},-}\in {{\mathbb{R}}}^{\ge 0}\), of \({f}_{s,r}^{{\rm{A}}}\) and \({f}_{s,r}^{{\rm{B}}}\), for each sample \(r\) and genomic segment \(s\). The fractional copy number estimates, \({\widehat{f}}_{s,r}^{{\rm{A}}}\), \({\widehat{f}}_{s,r}^{{\rm{B}}}\), and their associated upper and lower bound estimates \([{f}_{s,r}^{{\rm{A}},-},{f}_{s,r}^{{\rm{A}},+}]\) and \([{f}_{s,r}^{{\rm{B}},-},{f}_{s,r}^{{\rm{B}},+}]\) represent measurements obtained from standard DNA-sequencing experiments.
Problem formulation
Hence, given allele-specific estimates of fractional copy numbers from multiple bulk DNA-sequenced tumour samples, ALPACA aims to infer the integer clone copy-number values \({c}_{s,i}^{{\rm{A}}}\) and \({c}_{s,i}^{{\rm{B}}}\) that minimize the objective functions above, similarly to the clone copy-number mixture deconvolution problem explored in previous work2,3,15,16,19,31. In contrast to existing methods, ALPACA does not however aim to jointly infer the number of clones present \(n\), and their clone proportions \({u}_{i,r}\). Instead, as described above, the hypothesis underlying ALPACA is that copy-number inference from fractional copy numbers can be improved by integrating other data that can be inferred by existing methods, in addition to the estimated fractional copy numbers: the topology of the tree \(T\) defined by the related edges \(E\), and the proportion \({u}_{i,r}\) of each clone \(i\) in each sample \(r\). Using fixed clone proportions is expected to improve the performance of copy-number inference because it simplifies the complex copy-number deconvolution problem into an easier linear optimization problem.
In fact, the tree, \(T\), that is taken as input to ALPACA can be derived from standard approaches for phylogenetic reconstruction from SNVs21,22,23,24,26,27. Similarly, clone proportions of the tumour clones \({u}_{1,r},\,…,\,{u}_{n,r}\) in sample \(r\) can be derived from standard approaches for tumour clonal reconstruction using SNVs21,22,23,24,25,26,47,48,49, and we assume that the tumour purity of each sample \(r\) (that is, \((1-{u}_{0,r})\)) can be estimated from standard approaches for tumour deconvolution from bulk DNA sequencing (for example, ASCAT28 and Sequenza29).
Overall, ALPACA thus aims to infer the unknown copy numbers of each clone in a tree \(T\) with a given topology (that is, given the set of edges \(E\)) and given the fractional copy numbers in each sequenced tumour sample. Formally, ALPACA is framed as a linear optimization algorithm, aiming to find the copy numbers \({c}_{s,i}^{{\rm{A}}}\) and \({c}_{s,i}^{{\rm{B}}}\) for all clones \(i\in \{1,\,…,{n}\}\) in all segments \(s\in \{1,\,…,{m}\}\) that satisfy the following model objectives:
-
1.
Given the observed, allele-specific fractional copy number estimates in each sample \({\widehat{f}}_{s,r}^{{\rm{A}}},{\widehat{f}}_{s,r}^{{\rm{B}}}\), their upper and lower bound estimates \([\,{f}_{s,r}^{{\rm{A}},-},{f}_{s,r}^{{\rm{A}},+}\,]\) and \([\,{f}_{s,r}^{{\rm{B}},-},{f}_{s,r}^{{\rm{B}},+}\,]\), the clone proportions in each sample \({u}_{i,r}\), and constraints imposed by the phylogenetic tree \(T\), find \({c}_{s,i}^{{\rm{A}}},{c}_{s,i}^{{\rm{B}}}\in {\mathbb{N}}\) that minimize the number of samples for which \({\sum }_{i=1}^{n}{c}_{s,i}^{{\rm{A}}}{u}_{i,r}\,\notin \) \([\,{f}_{s,r}^{{\rm{A}},-}\,,{f}_{s,r}^{{\rm{A}},+}\,]\) or \({\sum }_{i=1}^{n}{c}_{s,i}^{{\rm{B}}}{u}_{i,r}\notin [\,{f}_{s,r}^{{\rm{B}},-},{f}_{s,r}^{{\rm{B}},+}\,]\).
-
2.
Among these solutions, find \({c}_{s,i}^{{\rm{A}}}\) and \({c}_{s,i}^{{\rm{B}}}\) that minimize the absolute difference (Manhattan distance) between the predicted and expected fractional copy number, \({| {\widehat{f}}_{s,r}^{{\rm{A}}}-{\sum }_{i=1}^{n}{c}_{s,i}^{{\rm{A}}}{u}_{i,r}| }_{1}\) and \({| {\widehat{f}}_{s,r}^{{\rm{A}}}-{\sum }_{i=1}^{n}{c}_{s,i}^{{\rm{B}}}{u}_{i,r}| }_{1}\).
ALPACA solves this optimization problem using an algorithm based on a mixed-integer linear program (MILP implemented with gurobipy, v11.0.1) and a model selection approach defined in the following section and in the Supplementary Information.
Inferring clone-specific copy numbers
The ALPACA algorithm solves the problem stated above using a MILP and a model selection approach2,3,15,31. Specifically, given a fixed number of copy-number events \({\lambda }_{s}\in {\mathbb{N}}\), ALPACA identifies the copy numbers that minimize the above model objectives 1 and 2 under the previously described constraints, using an MILP formulation (details are reported in Supplementary Information). That is, for each segment \(s\), we aim to find \({c}_{s,i}^{{\rm{A}}}\) and \({c}_{s,i}^{{\rm{B}}}\) that minimize the objective functions subject to constraints described above, such that the total number of copy-number events across all edges \(({v}_{i},{v}_{j})\) of the tree \(T\) (including both gains and losses), and both alleles is less than \({\lambda }_{s}\). ALPACA performs this optimization for each segment independently, as the evolution of each segment is modelled independently for the fixed tree topology. This allows ALPACA to scale to tumours with a high burden of subclonal SCNAs and thus a high number of segments at once.
The number of copy-number events \({\lambda }_{s}\) that occur throughout evolution is unknown. Therefore, similarly to previous studies4,31, we apply a model selection approach to find \({\lambda }_{s}\) while finding the best compromise between solutions with higher complexity (higher values of \({\lambda }_{s}\)) and those with better fit of the data (lower values of the objective function). Specifically, we consider an increasing number of copy-number events \({\lambda }_{s}\), and we select the best number of events, \(\hat{{\lambda }_{s}}\), using the elbow approach implemented by the Kneedle algorithm (kneed v0.8.5)50. Further details on the model selection algorithm used by ALPACA are available in the Supplementary Information.
The ALPACA algorithm outputs inferred clone copy numbers \({c}_{s,i}^{{\rm{A}}}\) and \({c}_{s,i}^{{\rm{B}}}\) for each segment \(s\in \{1,\,…,{m}\}\) within every ancestral and observed tumour clone in the \(k\) sequenced samples.
Run time
In the experimental TRACERx421 primary cohort, the total CPU time for a single tumour ranged from 48 s to 36 h with a median run time of 19 min. The differences in run time across tumours can be partially explained by the number of segments (ALPACA iterates over each segment independently) and by the number of SNV clones detected in a tumour; mean purity and ploidy of tumours show only a very weak correlation with the CPU run time (Extended Data Fig. 2). Iterating over genomic segments independently allows for a very efficient parallelization, in which segments predicted to complete quickly can be grouped into large batches, and segments predicted to complete slowly can be grouped into small ones.
Benchmarking ALPACA on simulated data
MASCoTE simulations
Creating input
We used the previously published simulation framework MASCoTE2 to simulate multi-sample bulk sequencing, using the following steps.
First, to create the input required by ALPACA, we mapped the 50-kilobase binned simulated data to the ground-truth segments, represented by \(S\). Second, we calculated the average read depth \(R\) of each segment \(s\in \{1,\,…,{|S|}\}\) and each sample \(r\in \{1,\,…,{k}\}\), \({R}_{s,{r}}\), by averaging \(R\) across all bins \(b\in B\) overlapping with a segment \(s\):
\({R}_{s,{r}}=\frac{1}{|{B}_{s}|}\sum _{b\in {B}_{s}}{R}_{b}\), in which \(|{B}_{s}|\) is the number of bins overlapping with segment \(s\), and \({R}_{b}\) is the read depth of bin \(b\). Using the same method, we then calculated the BAF \(B\) for each segment \(s\) and each sample \(r\), using the following equation:
$${B}_{s,r}=\frac{1}{|{B}_{s}|}\sum _{b\in {B}_{s}}{B}_{b}$$
For each segment \(s\) and each sample \(r\), we then determined the single nucleotide polymorphism (SNP) confidence intervals by fitting a t-distribution to the read depth values within each segment and sample. To simulate the experimental noise, we first calculated the true fractional copy numbers \({F}_{T}\) for each sample \(r\) and segment \(s\) by taking the dot product of two vectors: the vector representing the clone proportions in the sample \({{\bf{U}}}_{{\bf{r}}}\) and the vector of integer copy numbers of the clones present in the sample, \({\bf{C}}\):
$${F}_{T}={{\bf{U}}}_{{\bf{r}}}\cdot {\bf{C}}$$
Using the true fractional copy numbers \({F}_{T,r}\) and read depth \({R}_{r}\) for each sample \(r\), we derived a scaling factor for each segment and each sample, \({\varGamma }_{S,R}\), by dividing \({F}_{T,r}\) by the \({R}_{r}\) values for each segment \(s\) and each sample \(r\). By taking a mean of \({\varGamma }_{s,r}\) across all segments for each sample, we obtained the mean scaling factor per sample \(\bar{{\varGamma }_{r}}\):
$${\varGamma }_{s,r}=\frac{{F}_{T,s,r}}{{R}_{s,r}}$$
$$\bar{{\varGamma }_{r}}=\frac{1}{|S|}\sum _{s\in S}{\varGamma }_{s,r}$$
By multiplying the mean scaling factor per sample, \({\varGamma }_{r}\), by \(R\) of each segment in each sample, we obtained the final total fractional copy numbers, \(F\), for each segment in each sample:
$${F}_{s,r}=\bar{{\varGamma }_{r}}\times {R}_{s,r}$$
Last, using previously calculated \(B\), we obtained fractional copy numbers for alleles A and B (\({F}^{{\rm{A}}}\) and \({F}^{{\rm{B}}}\)) for each segment and each sample:
$${F}_{s,r}^{{\rm{A}}}={F}_{s,r}\times ({1-B}_{s,r})$$
$${F}_{s,r}^{{\rm{B}}}={F}_{s,r}\times {B}_{s,r}$$
Accuracy calculation
To compare ALPACA, HATCHet, HATCHet2 and cloneHD results using MASCoTE simulations, we used the accuracy metric adapted from a similar metric used in the HATCHet2 publication3. Let \({S}_{s}\) signify the set of allele-specific, integer copy-number states inferred by a model for a specific genomic segment \(s\). Let \({T}_{s}\) signify the set of true integer copy-number states for segment \(s\). Let \({l}_{s}\) signify the length of segment \(s\) (expressed in base pairs), and let \(L=\sum _{s}{l}_{s}\) represent the total length of genomic segments under consideration. The accuracy of a specific model can then be represented as
$${\rm{Accuracy}}=\frac{1}{L}{\Sigma }_{s}{l}_{s}\frac{| {S}_{s}\bigcap {T}_{s}| }{| {S}_{s}\bigcup {T}_{s}| }$$
Simple model
Tx simulations in simple model
To recapitulate the clonal and SCNA landscape of NSCLC, we used the previously described simulated cohort5,21, which provided us with phylogenetic trees, copy-number events across the genome and clone proportions in each tumour sample. To obtain the ALPACA input using these data, we calculated the true fractional copy number for each simulated genomic segment and matched it with a real segment from the TRACERx421 primary cohort5 (randomly selected from a pool of segments with similar length and fractional copy number: tolerance ±10%). To obtain confidence intervals of these fractional copy-number values per segment, we assigned SNPs of the true segment to the simulated segment, and corrected the copy numbers of the assigned SNPs by the difference between the fractional copy-number states of the simulated segment and the true matched segment.
Implementation
The simple model was implemented similarly to the method described previously5. In brief, for each segment, clone and allele, the clone-specific copy number was computed as a rounded fractional copy number (to the nearest integer) of a sample in which a given clone was most prevalent, on the basis of the difference in cancer cell fraction values between samples in which the event is and is not present. For clones that have no representation in any sample, copy number was inherited from the closest ancestor. If no ancestors were available for a clone, this clone’s copy number was assigned to take the copy-number states of its children.
Matching comparison between methods
Creating input
To evaluate the ability of ALPACA to accurately infer the evolution of SCNAs, we extended the previously created realistic simulated instances that modelled SNV evolution (Tx simulations)5,21. This simulated dataset contained 2,704 clones and 87,393 genomic segments, amounting to 3,179,926 clone- and allele-specific segments. These simulated instances modelled not only truncal and subclonal SNVs, but also the effect of multiple other genomic alterations such as truncal and subclonal SCNAs and whole-genome duplication on SNV evolution, enabling their use to evaluate the reconstruction of SCNA evolution as well as SNV evolution. Additionally, the simulations were developed to reflect real non-small cell lung cancer phylogenies by using distributions of statistics measured from sequencing data for the TRACERx421 cohort5, such as the number of samples per simulation, the frequency of SCNAs and the proportion of events occurring truncally or subclonally. We used these simulations to evaluate the performance of phylogeny reconstruction using CONIPHER21 and ALPACA, and compared the results to the existing state-of-the-art copy-number reconstruction tools HATCHet2 (ref. 3) and MEDICC2 (ref. 4).
To reconstruct copy-number evolution, each method required sequencing information across the entire genome. However, as the Tx simulation framework was developed to simulate SNV evolution, the instances contained simulated sequencing data for only select regions of the genome (those affected by simulated SNVs). We therefore located the positions across the simulated genome with copy-number information and defined the size of the copy number alterations to cover between 50 and 100 mb, resulting in a simulated genome of 1-mb bins, for which each bin is denoted by \(i\). Any bins containing more than one simulated SNV were split into multiple smaller bins. For each sample \(s\), the ploidy \({\rho }_{s}\) and purity \({\mu }_{s}\) values from the Tx simulated sequencing data were used.
The sequencing depth of each bin, or the observed total number of reads \({t}_{i,s}\), was modelled using a Poisson distribution with the mean equal to the expected total number of reads present in each bin \(i\) for all cells in each sample \(s\). The expected total number of reads can be calculated from the true fractional copy number \({f}_{i,s}\), in which \({f}_{i,s}\) corresponds to the average total copy number of each bin \(i\) for all cells present in sample \(s\). The values of \({f}_{i,s}\), as well as the expected allele-specific fractional copy number for allele A \({x}_{i,s}\) and allele B \({y}_{i,s}\), for bins containing a SCNA were populated using the respective true fractional copy number values calculated in the Tx simulations, whereas the values for any bins without a SCNA were populated with either a diploid or tetraploid copy-number value depending on the most common value in the simulated sequencing data for each sample. The expected total number of reads is equal to \(\frac{{f}_{i,s}}{{\rho }_{s}}{\gamma }_{s}\), in which \({\gamma }_{s}\) represents the expected average coverage per sample. \({\gamma }_{s}\) was set to a value of 1,000 for all samples \(s\) in all simulated instances. Therefore, we simulated \({t}_{i,s}\) as drawn from a Poisson distribution with the mean equal to the expected number of total reads \({t}_{i,s}\, \sim {\rm{Poisson}}\left(\frac{{f}_{i,s}}{{\rho }_{s}}{\gamma }_{s}\right)\).
We next used a binomial model to simulate the observed B-allele frequency (BAF) of each bin. We defined the true BAF \({\beta }_{i,s}\) for each bin \(i\) and sample \(s\) on the basis of the fractional copy number of both alleles \({f}_{i,s}\) and allele B \({y}_{i,s}\) as \({\beta }_{i,s}\,=\,\frac{{y}_{i,s}}{{f}_{i,s}}\). We then modelled the observed BAF \({\hat{\beta }}_{i,s}\) of each bin using the binomial distribution with the probability of success equal to \({\beta }_{i,s}\) and the number of trials equal to a simulated coverage value \({\gamma }_{s}\) of 1,000, such that \({\widehat{\beta }}_{i,s}\, \sim \,{\rm{binomial}}({\gamma }_{s},{\beta }_{i,s})\).
The observed fractional copy-number values for each bin \(i\) and sample \(s\) were then calculated for allele A \({\hat{x}}_{i,s}\) and allele B \({\hat{y}}_{i,s}\) on the basis of the observed total number of reads present in each bin \({t}_{i,s}\), the ploidy of the sample \({\rho }_{s}\), and the observed BAF \({\hat{\beta }}_{i,s}\) using the following equations for allele A and allele B, respectively: \({\hat{x}}_{i,s}\,={t}_{i,s}\times {\rho }_{s}\times (1-{\hat{\beta }}_{i,s})\), \({\hat{y}}_{i,s}\,={t}_{i,s}\times {\rho }_{s}\times {\hat{\beta }}_{i,s}\).
Parameters used for HATCHet2 and MEDICC2
Using our set of ground-truth simulations (Tx simulations), we compared the performance of CONIPHER + ALPACA to HATCHet2 + MEDICC2 to accurately infer copy-number evolution. HATCHet2 (v2.0.1) was run with default parameters, except for an increase in the possible number of clones identified. A maximum value of 12 clones was used for simulations from LTXSIM001 to LTXSIM050, and a maximum value of 18 clones was used for simulations from LTXSIM051 to LTXSIM150, owing to the increased number of clones in the ground-truth instances. MEDICC2 (v2.0.4) was run with default parameters. For one simulated case (LTXSIM063), HATCHet2 did not complete within the 72 h time limit, and we excluded this case from the comparison.
Comparison metrics
HD matching clone comparison
To evaluate model performance using HD, we followed the following procedure. For each simulated clone, we calculated the HD between its ground-truth copy-number profile and the copy-number profiles of all clones predicted by the model. This allowed us to identify the most similar predicted clone for each true clone. After assigning a matching predicted clone and its corresponding HD to each true clone, we computed the mean of these distances. This final mean value represents how accurately the model retrieved the true copy-number states.
TVD
To evaluate model performance using TVD, we considered the output of each method (that is, copy-number states for each genomic segment and clone proportions associated with each such state) as probability distribution over discrete copy-number states and calculated the TVD between true and inferred distributions for each model. Owing to the high number of genomic segments in Tx simulations, TVD in Fig. 2c is averaged per patient before plotting.
Single-cell data processing
To evaluate the performance of ALPACA on real tumour data, we harnessed single-cell data acquired from 10x Genomics sequencing of a breast tumour and used CHISEL35 to estimate the copy number, reconstruct clonal architecture of the samples, generate ground-truth copy numbers and generate input data required by ALPACA. This dataset contained five samples (A–E). Sample A contained only normal cells and was excluded from this analysis. On the basis of the known number of cells within each clone, we calculated clone proportions in each sample and created a ground-truth reference for all clones excluding the MRCA, for which ground-truth data were not available. Next, we created confidence intervals for each segment and each sample by fitting the normal distribution to each segment’s fractional copy-number state with mean 0 and standard deviation of 0.1 to simulate noise. These data, together with the phylogenetic tree obtained from the original study35, were used as input to ALPACA. In this dataset, two clones dominate the tumour (JII and JIV), with the largest other clone corresponding to only approximately 4% of the sampled tumour (average across all samples, Extended Data Fig. 3e), and thus probably below the threshold of SNV clone detection using bulk sequencing using standard approaches5. Therefore, we excluded these small clones from the comparison.
Identifying regions affected by extensive copy-number heterogeneity
To identify regions affected by extensive SCNA events, we calculated the coefficient of variation for each allele’s fractional copy number across all samples within each genomic segment. We then averaged these values to obtain the mean coefficient of variation for each segment. Next, we ranked the genomic segments within each tumour on the basis of their mean coefficient of variation and created four cohorts by progressively filtering out the top 5%, 15%, 25% or 50% of the genome with the highest variation.
Mutation clustering concordance
We calculated clustering concordance scores for each pair of mutations, by checking whether mutations in both sets of results retained the same phylogenetic relationships in the default CONIPHER output compared to a cohort with 5%, 15%, 25% or 50% of the genome with the highest copy number variation removed. We classified each pair of mutations as: belonging to the same cluster; ancestor–descendant; descendant–ancestor; or parallel (not belonging to any other class). We scored each pair as concordant if the class of phylogenetic relationship was found to be the same in the full cohort and filtered cohort. The final score represents the fraction of concordant pairs out of all the possible pairs
Preprocessing real data for input to ALPACA
Cohort
All biological analyses in Figs. 3–6 and Extended Data Figs. 1d, 2 and 5–10) were performed on the processed multi-sample bulk WES data from the recent longitudinal, prospective cohort TRACERx421 (refs. 5,6). For each tumour, the eligibility criteria for input to ALPACA included: allele-specific copy number computed for at least two primary tumour samples; a tumour phylogenetic tree computed from SNVs present across these same tumour samples, comprising at least two tumour clones (that is, at least an MRCA clone and one subclonal tree edge); and computed clone proportions of each clone on the phylogenetic tree in each sample. This resulted in n = 395 primary tumours5 and n = 126 paired primary–metastatic tumours with all of the correct inputs required for ALPACA.
Obtaining allele-specific fractional copy numbers per bulk tumour sample
The fractional copy numbers were further processed in the following way. We first identified clonal segments, by performing a statistical test to identify whether the fractional, allele-specific copy numbers were significantly different from their closest integer (t-test, P-value threshold of 0.0001). Segments with values significantly different from the nearest integers were considered subclonal. We further tested for evidence of allelic imbalance, by performing a statistical test to identify whether the fractional, allele-specific copy numbers for both alleles were different from one another in at least one sample (t-test, P-value threshold of 0.001). For each subclonal segment without evidence for allelic imbalance in any sample, we recalculated their total fractional copy number by summing the original fractional copy numbers for both alleles and assigning an integer copy number (closest integer to the original fractional copy number of respective allele) to allele B if the original fractional copy numbers for both alleles were larger than the closest integers for both alleles, or to allele A if they were less than the closest integer. The remaining allele (A or B, respectively) was assigned a fractional copy number calculated as the sum of its nearest integer and total fractional copy number minus the sum of closest integers for both original fractional copy numbers.
Correcting erroneous homozygous deletions
To prevent small clones being predicted as having copy number zero in both alleles for long stretches of the genome, the ALPACA method additionally implements constraints on the permitted number of homozygous deletions. Specifically, we introduce a threshold \({h}=1\) within each genomic segment \(s\in \{1,\,…,{m}\}\), such that a clone can have a homozygous deletion only if: the clone or its descendants are present in a tumour sample that has a fractional copy number of less than the threshold \(h\) in both alleles; and the genomic segment is shorter than 50 MB.
Obtaining a tumour phylogenetic tree
To obtain a single tumour phylogenetic tree per patient case in the TRACERx study, we used the published phylogenetic trees from the recent primary5 and primary–metastasis6 TRACERx421 cohorts. These tumour phylogenies were constructed from DNA-sequencing data using CONIPHER21, as described previously5. For tumour cases with multiple possible tumour phylogenetic trees, the tree with the lowest error was selected21.
Obtaining tumour clone proportions
For each tumour case, clone proportions of each tumour clone in each tumour sample were computed from the phylogenetic tree and mutation-cluster mean phylogenetic cancer cell fraction (PhyloCCF) values using the R function compute_subclone_proportions from the CONIPHER R package (v2.2.0)21. The input parameter value force_clonal_100 was set to be TRUE, to force clones whose distribution of associated mutation PhyloCCF values was not statistically distinct from that of the truncal cluster PhyloCCF values in certain tumour samples to have mean PhyloCCF = 1 in these samples. This ensured that when computing the clone proportions the clonal diversity was reduced and prevented noisy cancer cell fraction estimates resulting in samples harbouring clones with very small clone proportion.
Computing clone ploidy
We used ALPACA to compute the clone ploidy for every clone from each tumour run through ALPACA in the TRACERx421 primary and primary–metastasis cohorts5,6. Clone ploidy was defined as follows: considering clone \(i\) with total copy number \({c}_{s,i}={c}_{s,i}^{{\rm{A}}}+{c}_{s,i}^{{\rm{B}}}\) in genomic segment \(s\), with segment size \({w}_{s}\), then the clone ploidy, \({p}_{i}\), was computed as the average clone copy number across all genomic segments, weighted by segment size:
$${p}_{i}=\frac{{\sum }_{s=1}^{m}{w}_{s}{c}_{s,i}}{{\sum }_{s=1}^{m}{w}_{s}}$$
WGD detection using ALPACA
To identify WGD events, we compared copy-number states across the genome between each clone and its parent and grandparent. If the mean ratio, weighted by segment length, between the child clone and the parent clone was above 1.5, we classified the child clone as genome-doubled. If this ratio was below 1.5, but the ratio between the child and the grandparent was above 1.5 and the parent clone was not already classified as genome-doubled, we also classified the child clone as genome-doubled.
Defining copy-number events on edges of the phylogenetic tree
Defining segment-specific copy-number events
For each tumour case and each genomic segment, we inferred the allele-specific copy number change on each edge of the phylogenetic tree on the basis of ALPACA’s output. Next, we applied the following criteria to classify whether a copy-number event had occurred on each edge. Considering an edge of the phylogenetic tree \(({v}_{i},{v}_{j})\in E(T)\), and the corresponding allele-specific clone copy number of the parent clone \({{c}^{{\rm{A}}}}_{i}\), \({{c}^{{\rm{B}}}}_{i}\) and the child clone \({{c}^{{\rm{A}}}}_{j}\), \({{c}^{{\rm{B}}}}_{j}\), we define the following criteria for copy-number events, in which \({p}_{i}\) represents the clone ploidy of clone \(i\):
Event name: criteria:
Copy-number gain: \({{c}^{{\rm{A}}}}_{j} > \,{{c}^{{\rm{A}}}}_{i}\) or \({{c}^{{\rm{B}}}}_{j} > {{c}^{{\rm{B}}}}_{i}\)
Copy-number loss: \({{c}^{{\rm{A}}}}_{j} < \,{{c}^{{\rm{A}}}}_{i}\) or \({{c}^{{\rm{B}}}}_{j} < {{c}^{{\rm{B}}}}_{i}\)
LOH: \({{c}^{{\rm{A}}}}_{i}\,\ne \,0\,\& \,{{c}^{{\rm{A}}}}_{j}\,=\,0\) or \({{c}^{{\rm{B}}}}_{i}\,\ne \,0\,\& \,{{c}^{{\rm{B}}}}_{j}\,=\,0\,\)
Amplification: \({{c}^{{\rm{A}}}}_{j} > 2\times \,{{c}^{{\rm{A}}}}_{i}\,\& \,{{c}^{{\rm{A}}}}_{j}\ge 4\,\& \,{{c}^{{\rm{A}}}}_{j} > {p}_{j}\)
or \({{c}^{{\rm{B}}}}_{j} > 2\times \,{{c}^{{\rm{B}}}}_{i}\,\,{\rm{\& }}\,{{c}^{{\rm{B}}}}_{j}\ge 4\,\& \,{{c}^{{\rm{B}}}}_{j} > {p}_{j}\)
Defining chromosome-arm-level copy-number events
For each edge of the phylogenetic tree, we considered a whole chromosome arm to be affected by a copy-number event on the basis of the following classification, based on previous sample-level chromosome-arm copy-number calling analysis10:
Chromosome-arm event name: criteria
Copy-number gain: 90% of the chromosome arm is affected by gain, for either allele
Copy-number loss: 90% of the chromosome arm is affected by loss, for either allele
LOH: \(\ge 90 \% \) of the chromosome arm is at copy number 0 in the child clone, but \(\le 20 \% \) of the chromosome arm is at copy number 0 in the parent clone, for either allele
Amplification: \(\ge 90 \% \) of the chromosome arm is amplified as per the criteria described above, for either allele.
Defining chromosome-arm-level copy-number events detectable with sample-level analysis
For each sample, we called an arm-level LOH event if at least 0.98 of the analysed genome had a fractional copy-number state below 0.5 for either allele.
Event-ordering analysis
Identifying and assigning top frequently altered driver events to edges of the phylogenetic tree
To select driver SNVs and relevant copy-number events to include in the Bradley–Terry model for ordering driver events, we first classified the genomic driver events for each tumour case in the TRACERx421 primary and primary–metastasis cohorts5,6. These driver events included chromosome-arm amplifications and LOH events, focal (gene-level) amplifications and SNV driver events. We considered amplifications affecting oncogenes on the basis of a list collated from known copy-number drivers and known pan-cancer drivers as described in a previous study5. Oncogenes were considered to be amplified if the gene locus was completely contained within a copy-number segment called as amplified on an edge of the phylogenetic tree (using criteria described above). If a focal amplification was overlapping with a chromosome-arm amplification on an edge of the phylogenetic tree, only the chromosome-arm-level event was considered to have affected that edge of the tree. Then, including both clonal and subclonal events, we quantified the frequency of the number of tumours in which any such event was called. We selected the following events that were identified most frequently across the cohort to include in the ordering analysis, for LUAD and LUSC separately:
Event ID: number of top most frequent events selected for ordering analysis:
Chromosome-arm amplifications: 10
Chromosome-arm LOH events: 10
Subclonal chromosome-arm amplifications: 5
Subclonal chromosome-arm LOH events: 5
Focal amplifications (genes not part of an arm event): 10
Subclonal focal amplifications (genes not part of an arm event): 5
All SNVs classified as driver mutations as described in a previous study26: 10
Subclonal SNVs classified as driver mutations as described in a previous study26: 5
From this list of the top most frequent alterations, in LUAD and LUSC separately, we further subset to the events that were called as present in at least five tumours.
Specifying the event-ordering model
On the basis of these most frequent alterations, for LUAD and LUSC separately, we counted the number of tumours harbouring an alteration X from the top frequent alteration list occurring on an ancestral node, followed by any descendant node on the tree harbouring alteration Y from the top frequent alteration list (as depicted in Fig. 4a). If multiple descendant clones harboured the same descendant alteration Y to ancestral alteration X, we counted the ancestor–descendant relationship only once per tumour. These pairwise counts were entered into a Bradley–Terry model including a bias-reduced estimate, which was implemented using the BTm and update functions from the R package BradleyTerry2 (v1.1-2)51. When running the ordering analysis for the TRACERx421 primary–metastasis cohort6, we additionally considered metastatic ‘seeding’ as an event that could affect an edge of the phylogenetic tree, and therefore the metastatic seeding was ordered in the same way as other top most frequent events. Clones were labelled as being a seeding clone on the basis of the criteria described previously6.
Expanded subclones
Expanded subclones were defined as non-MRCA tumour clones whose associated set of mutations seem clonal in at least one tumour sample5.
Clone classification
We classified clones into classes as follows (Fig. 5a):
-
1.
MRCA clones;
-
2.
shared clones (clones with mutations shared between the primary and metastatic samples);
-
3.
seeding clones (the most recent shared clones);
-
4.
primary specific (clones with mutations detected only in primary samples); and
-
5.
metastasis specific (clones with mutations detected only in metastatic samples).
Computing number of SCNAs on a tree edge
ALPACA infers copy-number events for each segment independently; however, it is known that SCNAs can overlap multiple genomic segments4,31,52,53. Using the ALPACA output, we hence aimed to compute a summary statistic, called the number of SCNAs on a tree edge, \(N\), to enumerate the actual number of copy-number events that occurred in the copy-number evolutionary tree inferred by ALPACA while taking into account neighbouring SCNAs. In brief, we computed \(N\) as follows. Considering each allele and each chromosome separately, we counted the number of contiguous copy-number segments with a copy-number gain event of the same amplitude, and the number of contiguous segments with a copy-number loss event of the same amplitude, between the parent and the child node. The total number of copy-number events on this tree edge was then computed as the total sum of gains and losses on the tree edge, across all chromosomes, across both alleles. We note that in contrast to previous studies4,52,53, this method does not model the most parsimonious number of interval events required to transform the copy-number profile of the parent clone to that of the child clone. A mathematical description of the specific method used to compute the number of gains, losses and overall number of SCNAs on each edge is available in Supplementary Information section 4.
Comparing rate of SNV and SCNA acquisition per tree edge
To compare the rate of SNVs and SCNAs acquisition on each tree edge, we considered the number of SNVs on each edge divided by the total sum of subclonal SNVs (that is, all mutations on subclonal edges of the tumour phylogeny) in the tumour across all tree edges, and compared this with the number of copy-number events (that is, number of SCNAs) on the same edge, divided by the total sum of subclonal copy-number events across all edges of the tumour. This analysis was run considering only subclones in both the TRACERx421 primary and primary–metastasis cohorts. Further, using the set of all subclones in the TRACERx421 primary–metastasis cohort, we fitted a linear mixed-effects model with the dependent variable being the normalized number of SCNAs, tumour as a random-effect explanatory variable, and the fixed-effect explanatory variables: normalized number of SNVs, histology and clone type (shared clones (reference level), seeding clones, primary clones or metastatic clones). Then, using the same set of all subclones in the TRACERx421 primary–metastasis cohort, we fitted a mixed-effects logistic regression model with the dependent binary variable of whether a clone was metastatic or not, tumour as a random-effect explanatory variable, and the fixed-effect explanatory variables number of SNVs, number of gains on the edge, number of losses on the edge and histology.
Comparing SCNAs in seeding versus non-seeding clones
To identify specific copy-number events occurring on evolutionary trajectories leading to metastatic seeding, we considered only a subset of tumours from the TRACERx421 primary–metastasis cohort that satisfied the following conditions: the tumours had exclusively subclonal seeding; and the tumours had at least one non-seeding trajectory, which is defined as the tree path connecting the MRCA clone to a non-seeding primary clone: a primary clone that has no ancestors that are seeding (n = 71; Extended Data Fig. 8a). This set of tumours thus had both of the following tree paths present: paths connecting the MRCA clone to a subclonal seeding clone, and paths connecting the MRCA clone to a subclonal non-seeding primary clone.
For each oncogene in our collated list5, for each tumour, we performed a binary classification of whether a copy-number increase has occurred on a trajectory between the (non-seeding) MRCA and any seeding subclone, and a binary classification of whether a copy-number increase has occurred on a trajectory between the non-seeding MRCA and any non-seeding subclone (a copy-number decrease is considered for TSGs, respectively, and the specific copy-number events we considered are listed below). Then, for each oncogene event, we counted the number of tumours for which an increase affected subclonal seeding trajectories and the number of tumours for which an increase affected subclonal non-seeding trajectories. There are a different number of seeding and non-seeding clones in each tumour, and indeed a larger number of non-seeding clones overall, for each oncogene event. Hence, we tested whether the counts of tumours with events in seeding and non-seeding trajectories differed from a background count of the total number of such events affecting seeding and non-seeding trajectories summed across all non-driver genes.
We categorized ‘increase’ copy-number events at oncogene loci into: gains, gains >1 copy, amplifications. For TSG loci, we categorized ‘decrease’ copy-number events into: losses, losses >1 copy, LOH and LOH 0|1 (which describes the clone copy-number state of one allele with copy number 0, and the remaining allele with 1 copy).
Calculating CCD
To calculate CCD for clones within the primary cohort, we first removed all of the clones exclusively found in the lymph nodes during the primary surgery, as these clones represent cells that have already migrated from the primary site. For the remaining clones, we concatenated the vectors of predicted clone copy-number states for alleles A and B along the genome for each clone, and computed the pairwise Euclidean distance of this vector between each pair of clones. Then, for each tumour case in the TRACERx421 primary cohort5, the maximum value distance between any two clones was chosen as the CCD score for that tumour.
Clinical data and survival analysis
The clinical data were preprocessed in the same way as described in previous work5. Multivariable Cox model and Kaplan–Meier analyses were performed as described in previous work5. Covariates — age, stage, pack-years, histology, adjuvant and fraction of the aberrant genome with subclonal SCNAs computed on a sample level (termed ‘SCNA-ITH’ in ref. 5) — were processed as described in ref. 5 and were included as covariates in the Cox proportional hazards model.
Statistical information
Statistical tests were performed using either R or Python (scipy v1.10.1).
All tests were two-sided, unless stated otherwise, and were carried out using paired or unpaired options as appropriate, unless stated otherwise. Hazard ratios and P values for the survival analysis were computed using the R Survival packages. For all plots involving statistical tests, all data points included are plotted or the number of data points included is specified in the figure legend.
When data is represented using box plots, the box represents the interquartile range with the median line. Whiskers denote the lowest and highest values within 1.5 times the interquartile range from the first and third quartiles, respectively.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Source link