Sample collection and sequencing
Our FF–WGS samples were comprised of processed data from previous sequencing experiments of our evolutionary predictions in CRC cohort, which has been described in refs. 15,23. All patients gave informed consent in writing for collection of their materials to the UCLH Cancer Biobank (Research Ethics Committee approval 15/YH/0311). All investigators were blinded to patient data related to outcome, and all clinicopathological information.
Our FFPE–PS samples originated from 11 stage III microsatellite-stable cancers with lymph node metastases from the same cohort, 9 of which are shared with FF–WGS samples. FFPE sections were cut in the following order: H&E-1 (5 μm), 5 × 8 μm for laser capture microdissection, H&E-2 (5 μm), 8 × 5 μm for CyCIF, H&E-3 (5 μm). The H&E slides from each FFPE block were digitized using the NanoZoomer S210 or S60 (Hamamatsu). Images were reviewed using NDPViewer software (v.2.9.29).
Regions of interest (ROIs) were identified as single glands or clusters of small adjacent glands (microbiopsies) from superficial, invasive margin or lymph node deposits. Superficial regions were defined as cancer regions adjacent to or contiguous with normal mucosa. The tumor–normal interface was identified where possible, and the invasive margin was defined as the region within 500 μm either side of the tumor–normal interface (with an overall extent of ~1 mm).
Panel sequencing
ROIs were microdissected using PALM MicroBeam Laser Microdissection (Zeiss). DNA was extracted using the High Pure FFPET DNA Isolation Kit (Roche). Extracted FFPE DNA was repaired using the NEBNext FFPE DNA Repair Mix. Postrepair, whole-genome libraries were prepared using the NEBNext Ultra II FS DNA Library Prep Kit for Illumina with unique molecular identifier (UMI) adapters ligated onto DNA molecules.
PS on FFPE samples was carried out using a custom targeted panel designed by L. Zapata and manufactured by Twist BioSciences, focusing on regions encoding the immunopeptidome and antigen-presenting related genes. The immunopeptidome was defined as the set of the human nine-mers that bind strongly to one of the top 70 HLA alleles, was confirmed T cell positive in IEDB and were derived from a gene with mean expression of more than one fragments per kilobase million pancancer. The final list of these immunopeptidome loci can be obtained from: https://github.com/luisgls/SOPRANO/blob/master/immunopeptidomes/human/allhlaBinders_exprmean1.IEDBpeps.unique.bed.
The Twist Target Enrichment Protocol was used to hybridize probes from the this custom panel with prepared libraries. Hybridized targets were isolated and amplified with PCR. Paired-end 50-bp runs were performed on Novaseq S1 (Illumina).
Processing of sequencing data
The processing of FF–WGS data was detailed in ref. 15.
For FFPE–PS samples, three sets of fastq reads were aligned to human reference genome build hg38 to generate an unmapped bam using FastqToBam (Fgbio v.1.3.0). Fastq files were then created from this unmapped bam (SamToFastq, Picard v.2.20.3) and aligned to human reference genome build GRCh38/hg38 with Burrows–Wheeler Aligner package BWA-MEM v.0.7.17. Alignment data from the outputted bam from this step were merged with data in the previously generated unmapped bam (using the MergeBamAlignment function of Picard v.2.20.3). The merged bam file was then used as input for GroupReadsByUMI (Fgbio v.1.3.0). ‘Adjacency’ was used as the strategy for grouping so that errors were allowed between UMIs, but only when there was a count gradient. The allowable number of edits/changes between UMIs was set to 1, and minimum mapping quality to 30 (default).
Consensus sequences were called from reads with the same unique molecular tag (CallMolecularConsensusReads) and filtered using FilterConsensusReads to exclude consensus sequences with fewer than two contributing reads, mask consensus bases with quality less than 30, accept a maximum raw-read error rate across the entire consensus read of 0.05, accept a maximum error rate for a single consensus base of 0.1 and accept a maximum fraction of 0.2 for no-calls in the read after filtering.
The filtered reads were then re-aligned to GRCh38/hg38, and variant calling was performed using Mutect2 (v.4.1.4.1) with a bed file specifying regions of the genome covered by the targeted panel. Normal samples from adjacent muscle were used as matched normal for variant calling. The resulting vcf files were merged for each patient and passed to Platypus (v.0.8.1.1) for multiregion variant calling.
Variant calls were filtered to retain mutations within certain filtering criteria and adequate support for the variant, as described in ref. 15. The same filters were used for both FF–WGS and FFPE–PS samples, except for requiring at least five and at least eight reads covering each site in FF–WGS and FFPE–PS samples, respectively.
FF–WGS samples sequenced at shallow depth were genotyped and fitted on phylogenetic trees using the method described in ref. 15.
Raw RNA-seq read counts were normalized and converted using DESeq2 as described in ref. 15. The code for reproducing RNA-seq processing can be found at https://github.com/JacobHouseham/EPICC_transcriptome. Raw counts were converted either to transcript per million (TPM) values or processed counts following VST, see ‘2.gene_expression_normalisation_and_filtering.Rmd’ within the above repository. TPM values were used to assess expression levels of genes (for example, to establish constitutively expressed genes), while VST values were used for between-sample comparison of a particular gene.
HLA haplotyping
HLA-A, -B and -C haplotyping was performed using polysolver47 (v.1.0) on FF–WGS samples, by running shell_call_hla_type with default settings. As ethnicity information was not available, we used ‘Unknown’ for all samples. To increase coverage, we used merged bam files created from all sequencing files from a given cancer. For validation, we also performed haplotyping on merged bams formed of normal (blood or normal colon tissue) samples and compared the predicted haplotypes. The predicted haplotypes had a high concordance, with haplotypes predicted using all samples providing one more heterozygous haplotype than normal-only haplotypes in 6 of 30 cases. Based on the average homozygosity across CRCs (as seen in TCGA CRC samples8), we accepted the more heterozygous set of alleles predicted to define their set of HLA alleles. For all HLA alterations, we confirmed that the alterations are called independent of which haplotyping calls were used.
For FFPE–PS samples we used the calls derived from matched FF–WGS samples. For the two patients where this was not available, we performed haplotyping on adjacent normal mucosal samples following the same steps as described above.
Immune escape prediction
Mutations in HLA
Somatic mutations in the HLA locus were predicted using polysolver47 (v.1.0). The mutation detection script of polysolver (shell_call_hla_mutations_from_type) was run on matched tumor–normal pairs to call tumor-specific alterations in HLA-aligned sequencing reads using MuTect (v.1.16). In addition, Strelka248 (v.2.9.10) was run independently to detect short insertions and deletions in HLA-aligned with increased sensitivity. Both single nucleotide mutations and frameshift alterations passing quality control were annotated by shell_annotate_hla_mutations. Based on this annotation, a mutation in the HLA locus was called if a mutation passed all quality filters and introduced either a missense/nonsense change or was located at a splice site. In addition, we also identified second-tier mutations in unfiltered MuTect files that were detected in insufficient reads to pass quality control, but the exact same nucleotide change was clearly detected in another biopsy of the same tumor, allowing detection in samples with low purity or sequencing coverage.
LOH in HLA
LOH at the HLA locus was predicted using LOHHLA33 and sequenza49. First, we evaluated the allele-specific copy number as predicted by sequenza (v.2.1.2) at the HLA-A, HLA-B and HLA-C loci. Samples with a predicted minor allele copy number of 0 (for example, 2:0, 3:0) were labeled as candidate LOH. Then, we ran LOHHLA with the polysolver-generated haplotype files and matched tumor–normal pairs as input. A type I allele of a patient was annotated as ‘allelic imbalance’ (AI) if the associated P value was lower than 0.01. Alleles with AI were labeled as LOH if the following criteria held: (1) the predicted copy number of the lost allele was below 0.5 with confidence interval (CI) strictly below 0.7; (2) the copy number of the kept allele was above 0.75; (3) the number of mismatched sites between alleles was above 10.
HLA LOH was identified by both methods independently in 18 deep-sequenced samples. HLA LOHs called by only one of the methods were inspected manually. Of LOHHLA-exclusive calls, two were found to be false positives and four were identified as AIs with a minor allele copy number of 1. Of sequenza-exclusive calls, one region showed clear LOH and got classified as HLA LOH; six samples showed similar CN pattern but were missed by LOHHLA due to low purity—as adjacent low-pass sequenced samples showed CN = 1 in the HLA region, we also classified these as HLA LOH. In addition, 25 regions had allelic imbalance detected by LOHHLA that were also confirmed in sequenza to have a CN state of 2:1 or 3:1.
Mutations in APGs
We assembled a list of genes involved in class I type MHC presentation, by using the KEGG pathway ‘antigen processing and presentation’ and MHC I pathway specifically. The following genes were considered: TAP1, TAP2, IRF1, NLRC5, TBK1, PSME3, PSME1, ERAP2, ERAP1, HSPBP1, CALR, B2M, PSME2, PSMA7, CANX, CIITA, TAPBP, CREB1, HLA-A, HLA-B, HLA-C, HSP90AA1, HSP90AB1, HSPA2, HSPA4, HSPA5, HSPA6, HSPA8, IFNG, NFYA, NFYB, NFYC, RFX5, RFXANK and RFXAP. Then, we evaluated the expression of each gene within our cohort and filtered out genes that were not clearly expressed (≥10 TPM) in at least 5% of samples.
Then, we called evaluated the mutations called in these genes following. Only mutations with at least moderate predicted impact were called.
Neoantigen prediction and proportional burden computation
We predicted neoantigens from somatic mutation calls and patient-specific HLA haplotypes using NeoPredPipe26 for both FF–WGS and FFPE–PS samples. We defined neoantigen burden in a sample as the number of (unique) mutations giving rise to at least one strong-binding (rank <0.5) neoantigen. We focused our analysis on SNV neoantigen burden, unless stated otherwise. We also computed the total protein-changing mutation burden, and used this value to obtain the proportional neoantigen burden for each sample, that is, what percentage of mutations that has the potential to create a neoantigen does actually lead to strong-binder neoantigens.
Defining clonal/subclonal neoantigens
We assigned clonal/subclonal categories to all mutations (independent of neoantigen status) based on their presence/absence in all available deep- or panel-sequenced sample of a given cancer. As the targeted genome region, sequencing strategy and sample types were different, we created separate mutations lists of FF–WGS and FFPE–PS samples. For FF–WGS samples, mutations present in all sequenced cancer biopsies were denoted as clonal and all other mutations (absent in at least one biopsy) as subclonal. For FFPE–PS samples, mutations present in all biopsies were deemed clonal, and mutations absent in at least two biopsies were denoted as subclonal.
Immune dNdS
Immune dNdS was computed using SOPRANO25, with somatic mutation files and personalized immunopeptidome files derived specifically to HLA haplotypes. First, the ratio between dNdS inside (ON-target dNdS) and outside the immunopeptidome (OFF-target dNdS) was computed and corrected for 192-trinucleotide context. Then immune dNdS was computed as the ratio of ON-to-OFF dNdS to correct for technical artefacts that could bias dNdS as computed in OFF-target regions. Samples without any ON- or OFF-target synonymous mutations were excluded from the analysis. In total, immune dNdS estimate was available for 61 FF–WGS and 41 FFPE–PS samples.
To compute immune dNdS separately, we first filtered somatic mutation files to only contain mutations annotated as clonal, then repeated the above procedure on these files.
SCAA and SCAA loss analysis
We identified SCAAs by comparing purity-corrected and copy number-corrected ATAC-seq peak calls of cancer regions (per cancer) to a pool of normal glands15. For immune escape genes, we filtered SCAAs for those located in promoter or enhancer regions associated with a gene from the list detailed in ‘Mutations in antigen-presenting associated genes.’ Each SCAA was labeled as loss (fold change of cancer compared to normal < −1) or gain (fold change of cancer compared to normal > 1).
To evaluate the observed number of SCAA losses, we repeated the analysis 200 times with a set of 25–40 randomly chosen genes and computed the number of SCAA losses and gains. We derived a P value as the number of random samples we observed a value more extreme than for APGs.
For neoantigen SCAA loss analysis, we filtered SCAA calls to obtain only losses located in promoter regions (within 1,000 bp of the transcription start site of a gene). Each SCAA loss was annotated by the gene to which they were proximal. As most SCA alterations were found to be clonal, we defined SCAA losses on the cancer level.
We then evaluated for each protein-alteration mutation within a cancer whether the gene it is in had an associated SCAA loss or not. For each cancer, we computed the proportion of mutations (both for neoantigens and for nonantigenic mutations) that were classified as downstream of a SCAA loss. Similarly, for each SCAA loss in a given cancer, we evaluated whether it was upstream of a protein-changing mutation and, if so, within each cancer, we counted the number of SCAA losses upstream of neoantigen/nonantigenic mutations.
TF analysis
We used the TF binding sites obtained from curated human TF motifs in ref. 21, also available as part of the processed Mendeley dataset (‘Data availability’).
We examined the list of ATAC-seq peaks that showed statistically significant somatic loss in at least one cancer (27 peaks, representing 20 unique APGs). We then filtered this set to peaks with loss in more than one patient (recurrent SCAA losses), leading to a total of ten genomic regions associated with eight APGs. We identified TFs that are predicted to bind to these regions as those with a binding site in <1,000 bp distance of a given peak, using the distanceToNearest function of the GenomicRanges package in R. We selected for plotting ten TFs that bound more than two of the regions (Extended Data Fig. 2a).
Transcriptional editing analysis
For each SNV detected in FF–WGS samples, we quantified the number of reads in the matched RNA-seq data supporting the mutation using bam-readcount (v.1.0.1)50. For a mutation to be classified as expressed in a sample, we required three or more RNA reads overlapping the position to support the variant base. For a mutation to be classified as not expressed (that is transcriptionally edited), we required more than ten overlapping reads, with zero supporting the variant. We chose the threshold of more than ten so that the probability of misclassifying a mutation present at a true allele frequency of 0.25 or higher was <5%. Mutation–sample pairs that did not qualify for either of these categories were left blank to signify insufficient evidence.
For each cancer, we computed the proportion of mutations that had evidence of transcriptional editing (present in WGS but not expressed in RNA-seq) in at least one sample. Similarly, for each mutation in a cancer, we computed the number of samples that expressed that mutation and the number of samples that (confidently) did not express it. We excluded mutations that had sufficient evidence for expressed/not status in fewer than two samples. We used the median proportion of biopsies with evidence of editing to derive a single value per cancer per mutation type.
Analysis of H&E slides
Sections were cut from diagnostic FFPE blocks sampled at resection. H&E slides pre- and post-laser capture microdissection (LCM) and post-CyCIF were scanned using the NanoZoomer S210 slide scanner (Hamamatsu). Representative sections were selected for annotations, which were drawn on pre-LCM H&E slides as first choice in most cases. Annotations included: all tumor (on a slide); the tumor–normal interface; any deposits of cancer within nodes; and adjacent normal mucosa (within 5 mm of superficial tumor) and normal mucosa further away (>5 mm away from tumor).
Annotations were made using the NDPViewer software (v.2.9.29). The tumor–normal interface was expanded by 500 μm on either side to identify the invasive margin. Images with annotations were analyzed using a digital cell classifier, which uses deep learning methodology modeled on a spatially constrained neural network architecture51. The model first detects cells through predicted location of cell nuclei, then classifies them as: normal epithelial, cancer epithelial, fibroblast, lymphocyte, neutrophil, macrophage, endothelial. Absolute counts are calculated for each annotated region. For each annotation, number of cells were normalized to the total number of epithelial cells to obtain per epithelial cell counts.
In addition, we used a further classifier37 to class all tumor-associated lymphocytes as infiltrating, adjacent or distant based on proximity to tumor cells.
CyCIF imaging
ROIs that matched the cancer glands/microbiopsies used for LCM were identified using the H&Es pre-LCM and post-LCM and extended by an additional 1 mm at the edges. All CyCIF sections were 5 μm thick. The maximum separation between the LCM slides and the CyCIF slide was 10 μm.
The protocol was based on previous methods from ref. 24. Full details of fluorescent antibodies are listed in Supplementary Table 2. Image and statistical analysis of CyCIF images is detailed in Supplementary Methods.
Statistical analysis
All statistical analyses were carried out in R (v.4.4.2). All comparisons between pairs of samples were carried out using two-sided unpaired Wilcoxon rank sum test, without additional adjustment of P values, unless stated otherwise. In all figures, visual elements of the boxplots correspond to the following summary statistics: center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range.
Mixed-effect model with patient effect
To account for multiple samples originating from the same patient, we used a mixed-effects model implemented with the R package lme4 (v.1.1.36). We incorporated Patient as a random effect and the tested variable (for example CRA/CRC) as a fixed effect affecting the intercept into the mixed-effects model. The significance of the fixed effect was evaluated by comparing this ‘full’ model to one with only random effects, using ANOVA. The P value of this test (Pr(>Chisq)) is reported as P(fixed effect).
Evaluating neoantigens in expressed genes and transcriptionally/epigenetically silenced neoantigens
We created 2 × 2 contingency tables by counting mutations that are neoantigens/nonantigenic and (1) in a gene in the given gene group or not; (2) in a consistently expressed gene or not; (3) in a gene with somatically closed promoter or not or (4) not expressed (missing) in at least one sample or not. Fisher’s exact test was used to compute the OR and CIs on these tables. The above steps were repeated separately on clonal and subclonal mutations alone.
Multivariable regression
Multivariable regression models were constructed using the functions betareg (for proportional burden) and lm (for immune dNdS). Sample type (gland/bulk for FF–WGS and superficial tumor/IM/node for FFPE–PS), tissue type (adenoma/cancer), immune escape (escape/weak-escape/no-escape) and patient were encoded as categorical variables; purity as a continuous variable. For Extended Data Fig. 10, sample type and patient were encoded as categorical variables and all immune infiltrate values handled as continuous variables (unit: number per epithelial cell). Results were visualized using the plot_summs function from the package jtools, omitting the full list of coefficients assigned to patients (as well as sample type for Extended Data Fig. 10a) to make the visuals easier to interpret. Instead, on Fig. 5a,b, the smallest and largest patient-associated coefficients were identified and plotted to highlight the range of coefficients.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Source link