A haplotype-based evolutionary history of barley domestication

Sample selection for genome sequencing

Wild barley

Our wild barley panel (Supplementary Table 1) comprised 285 accessions from the Wild Barley Diversity Collection (WBDC)41,42, a collection of ecogeographically diverse accessions. The whole-genome sequencing (WGS) of the WBDC collection has been described in a companion paper17. A further 95 diverse barley accessions, mainly from the panel of Russell et al.14, were also included. The latter set of samples had been sequenced to approximately 3× coverage by Jayakodi et al.37. In the present study, we resequenced 32 of these samples to increase their coverage to approximately 10×.

Domesticated barley

Milner et al.24 defined 12 populations using model-based ancestry estimation with ADMIXTURE43 in a global diversity panel of 19,778 domesticated barley, which had been subjected to GBS24. We used the ADMIXTURE results and GBS SNP matrix of Milner et al.24 for sample selection. Except for the Near-eastern population (coloured orange in figure 1b of Milner et al.24), we selected samples according to the following procedure. First, unadmixed samples, that is, those with an ADMIXTURE ancestry coefficient q ≥ 0.95 were used as input for a PCA with smartpca44 (v7.2.1). Then, samples were selected to cover the PCA space evenly (Supplementary Fig. 5). Owing to its higher genetic diversity and internal substructure, a more sophisticated procedure was followed for the Near-eastern population (Supplementary Table 7 and Supplementary Fig. 6). First, ADMIXTURE43 (v1.23) was run on 1,078 samples of Milner et al.24, where the Near-eastern ancestry coefficient q was higher than that of all other populations, with q ranging from 0.25 to 0.98. Before running ADMIXTURE, the SNP set was thinned with PLINK45 (v1.9) using the parameters ‘–indep-pairwise 50 10 0.1’. For each value of K (the number of ancestral populations) from 2 to 6, the output of 15 replicate runs of ADMIXTURE with different random seeds was combined with CLUMPP46 (v1.1.2) and plotted with Distruct47 (v1.1). Individuals with q ≥ 80% for their main ancestry component were considered unadmixed. The results for K = 6 was chosen for further analysis. The genetic separation of the defined populations was confirmed with smartpca44 (v7.2.1). Only those samples of the Near-eastern subpopulations that were actually located in the Near East were selected for sequencing. The selected samples were distributed in an equidistant manner in the PCA diversity space. In total, we selected 302 samples from 15 populations (Supplementary Table 8 and Supplementary Fig. 7). The populations were named according to their geographical origins and three key traits closely connected to global population structure24 (Supplementary Table 6): row type (two-rowed (T), six-rowed (S) and mixed (M)); lemma adherence (hulled (H) and naked (N)); and annual growth habit (winter sown (W), spring sown (S) and mixed (M)). For example, ISR-THS refers to a population whose members are predominantly two-rowed hulled spring barleys from Israel. For each population, we selected about 20 accessions for WGS sequencing. Among these, 7–10 samples of each population (total: 116, ‘high-coverage samples’) were sequenced to approximately tenfold coverage. The remaining samples of each population were sequenced to approximately threefold coverage (total: 186, ‘low-coverage samples’). Seeds for all selected accessions can be ordered from the German Federal ex situ genebank at IPK Gatersleben.

Plant growth, DNA isolation and Illumina sequencing

Plant cultivation and DNA isolation were essentially as previously described24. Illumina Nextera DNA Flex WGS libraries were prepared and sequenced (paired end: 2 × 151 cycles) on an Illumina NovaSeq 6000 device at IPK Gatersleben according to the manufacturer’s instructions (Illumina).

Reads mapping and variant calling

The reads of 682 barley genotypes, of which 380 were wild and 302 domesticated, were mapped to the MorexV3 genome sequence assembly15 using Minimap2 (v2.24)48. Mapping statistics of all 682 accessions are shown in Supplementary Table 1. BAM files were sorted and deduplicated with Novosort (v3.06.05; https://www.novocraft.com/products/novosort/). Variant calling was done with bcftools (v1.15.1)49 using the command ‘mpileup -a DP,AD -q 20 -Q 20 –ns 3332’. The resultant ‘raw’ SNP matrix was filtered as follows: (1) only biallelic SNP sites were kept; and (2) genotype calls were deemed successful if their read depth ≥ 2 and read depth ≤ 50; otherwise genotypes were set to missing. SNP sites with fewer than 20% missing calls, and fewer than 20% heterozygous calls were used for ADMIXTURE runs (with K ranging from 2 to 4) as described above. At K = 4, wild individuals with 15% or more ancestry from domesticated barley were considered admixed. A total of 80 wild admixed samples were excluded from subsequent analyses (Supplementary Table 2 and Supplementary Fig. 21). A total of 251 wild barley samples with high coverage (approximately 10×) without domesticated admixture were used for subsequent population genetic analyses (Supplementary Table 3).

We prepared two SNP matrices, SNP1 and SNP2, for downstream analysis. For SNP1, we extracted the data for 367 (251 wild and 116 domesticated) high-coverage samples from the raw SNP matrix. SNP1 was filtered as follows: (1) only biallelic SNP sites were kept; (2) homozygous calls were deemed successful if their read depth ≥2 and read depth ≤ 50 and set to missing otherwise; and (3) heterozygous calls were deemed successful if the allelic depth of both alleles was 5 or more and set to missing otherwise. The SNP2 matrix contained variants for 302 domesticated samples and was constructed from another bcftools run using the same parameters as above but with a downsampled dataset, in which the read alignments of the high-coverage samples (n = 116) had been thinned so as to achieve a sequence depth comparable with that of the low-coverage samples (Supplementary Fig. 22) using SAMtools (v1.16.1)49 with the command ‘samtools view -s 0.FRAC’ (FRAC is the sampling rate). The targeted number of uniquely mapped (Q20), deduplicated mapped reads for the downsampled high-coverage data was set to a random number between 35 million and 52 million. Note that the read length was 2 × 150 bp in all samples. The matrix SNP2 was filtered as follows: (1) only biallelic SNP sites were kept; (2) homozygous calls were considered successful if their read depth ≥ 2 and read depth ≤ 20 and set to missing otherwise; and (3) all heterozygous calls were set to missing. A flow chart describing the construction of the SNP matrices used in this study is shown in Supplementary Fig. 23. In analyses in which the use of an outgroup was required, we used WGS data of Hordeum pubiflorum50. Read mapping and SNP calling were done as described above with one difference: a VCF file for all sites in the genome, including those identical to the reference genome, was obtained. This VCF file was merged with other VCF files to determine ancestral states. We also prepared a SNP matrix with 367 (251 wild and 116 domesticated) high-coverage samples using B1K-04-02 (FT11) as the genome reference16 for candidate gene search and SNP age calculation. The reads mapping, SNP calling and filtering procedures were the same as those used for generating the SNP1 matrix.

SNP-based genetic distances

The number of SNPs between any two high-coverage genotypes were calculated as follows. First, pairwise SNP numbers were determined in genomic windows with PLINK2 (v2.00a3.3LM)51 with the command ‘plink2 –from-bp x –to-bp y –sample-diff counts-only counts-cols=ibs0,ibs1 ids=s1 s2 …’, where x and y are the start and end coordinates of a window and ‘s1 s2 …’ is a list of sample IDs. Different window sizes were used: 100 kb (shift of 20 kb), 500 kb (shift of 100 kb), 1 Mb (shift of 200 kb), 2 Mb (shift of 400 kb) and 5 Mb (shift of 5 Mb). Then, in each window, a normalized distance measure was calculated to account for the fact that owing to differences in the mappability of short reads, the effective coverage differs between genomic windows40 (Supplementary Fig. 24). Per-bp read depth was determined for each sample and each position of the reference genome with the command ‘samtools view -q 20 -F 3332 | samtools depth’. The effectively covered region of each window was defined as the union of sites with read depths between 2 and 50. For each, pairwise comparison between samples, the effectively covered regions were intersected using a Perl script. The pairwise distance in a genomic window was calculated as (hom + het/2)/cov, where hom and het are numbers of homozygous and heterozygous differences, respectively, and cov is the size of the intersection of the effectively covered regions of both samples. Genomic windows were considered only if the latter quantity amounted to half the size of the window; otherwise the distance was set to missing.

Validation of SNP number estimation using accurate long reads

To evaluate the accuracy of our SNP number estimates, we used data from the second version of the barley pangenome16. Among the 76 accessions included in the barley pangenome, 13 overlapped with our sample set (Supplementary Table 20). We downloaded the HiFi reads of these 13 accessions and aligned them to the MorexV3 reference genome15 using pbmm2 (v1.10.0; https://github.com/PacificBiosciences/pbmm2). For HiFi reads, the effectively covered region was defined in the same manner as above but with read depths between 10 and 50 considering average HiFi sequencing coverage of approximately 25×. Variant calling was performed with DeepVariant (v1.6.0)52 to generate GVCF files for each sample, followed by joint genotyping using GLnexus (v1.3.1)53,54 to obtain a SNP matrix across the 13 samples. We applied the following filtering criteria: (1) only biallelic SNPs were retained; (2) only genotype calls with depth between 10× and 50× were kept; otherwise, the genotype was set as missing; and (3) for heterozygous calls, we required a minimum allele depth of 10 for each allele. We compared the effective covered region (uniquely mapped regions) of short-read and HiFi-read data across these 13 samples, as well as the intersection of effective covered regions between each pair of samples (Supplementary Fig. 25). The missing rate was calculated for each sample as the number of missing genotype calls divided by its effectively covered region. We then calculated pairwise SNP number between samples using the same method as described above, with a window size of 1 Mb (shift of 200 kb). Only 1-Mb windows in which the intersection of effective covered regions between the two samples exceeds 0.5 Mb were retained for SNP number calculation. Given that SNP number distributions along chromosomes are not always normally distributed — and may even be bimodal in certain cases — we applied Kendall rank correlation to evaluate the consistency between SNP numbers calculated from short reads and HiFi reads (Supplementary Figs. 26 and 27). Confidence intervals for Kendall’s tau correlation coefficients were calculated using a percentile bootstrap method with 1,000 resamples.

Linkage disequilibrium decay

The barley genome was split into three compartments (distal, interstitial and proximal) based on recombination rates19 (Supplementary Table 21 and Supplementary Fig. 28). Linkage disequilibrium decay was calculated for both wild and domesticated barley in each compartment using PopLDdecay (v3.42)55 with the command ‘-Het 0.99 -Miss 0.2 -MAF 0.01 -MaxDist 500’.

Population structure and divergence times in wild barley

Variants calls of 251 high-coverage wild barley samples were extracted from the matrix SNP1 (see above). SNP sites with fewer than 20% missing calls, fewer than 20% heterozygous calls and minor allele frequency (MAF) ≥ 5% were used in population structure analysis. Model-based ancestry estimation was done with ADMIXTURE (v1.23)43. The number of ancestral populations K ranged from 2 to 5. At K = 5, individuals with more than 85% of its main ancestry were considered as unadmixed wild barleys. PCA was done with smartpca (v7.2.1)44. Genotype calls of the outgroup sample H. pubiflorum were merged with the SNP matrix, and an IBS-based genetic distance matrix was calculated with PLINK (v1.9)45. The distance matrix was used to construct a neighbour-joining tree with Fneighbor (https://emboss.sourceforge.net/apps/cvs/embassy/phylipnew/fneighbor.html), which is part of the EMBOSS package56. The resultant tree was visualized with Interactive Tree Of Life (iTOL; v7)57. In each of the five wild barley subpopulations, the nucleotide diversity58 (π) and Watterson’s estimator59 (θW) were calculated from the SNP matrix without MAF filtering using a published Perl script40. Pairwise fixation indices (FST) between pairs of wild barley populations were calculated in genomic windows (size of 1 Mb, shift of 200 kb) using Hudson’s estimator with the formula given as equation 10 (ref. 60) using a published Perl script40. Coverage-normalized SNP distances were calculated as described above in 1-Mb genomic windows (shift of 200 kb). Distributions of log10-transformed distances in the genomic compartments distal, interstitial and proximal were plotted for each wild barley population in R (v3.5.1)61. To infer divergence times, only SNPs in a 50-Mb region flanking the centromeres (±25 Mb) were used. SNP distances were converted into divergence times using the formula g = d/2μ, where g is the number of generations, μ is the mutation rate and d is the number of SNPs per bp. We assumed that the generation time in the annual species H. vulgare is 1 year. We used a random mutation rate of 6.13 × 10−9 as had been determined by Wang et al.62 in the Pooideae grass Brachypodium distachyon. The SNP number distribution was visualized by frequency polygons with logarithmic binning (number of bins of 50, range of 101–104.5 (31,622 SNPs)).

Demographic history of wild barley

Demographic inference was done with PSMC8 (v0.6.5-r67, default parameters) using pseudo-diploid genomes, which were created by combining the BAM files of two homozygous individuals as previously described63,64,65. We performed two types of PSMC analyses. The first was conducted separately for five wild barley subpopulations to infer their respective demographic histories (Supplementary Fig. 2). The second treated all wild barley samples as a single population to capture the average demographic history of the species (Extended Data Fig. 3b). For the population-specific PSMC analysis, we first calculated the IBS distribution for all pairwise combinations of individuals within each group. On the basis of the distribution, IBS values were divided into two to four bins. Within each bin, we selected either all sample pairs (if the number of combinations was fewer than 50) or 50 pairs (if the number of combinations was more than 50) evenly distributed from low to high IBS values (Supplementary Table 22). In selecting sample pairs, we also considered the sequencing coverage of each individual. A pair was retained only if the ratio of coverage, defined as ratio = coveragesample2/(coveragesample1 + coveragesample2), fell within the range 0.45–0.55. For the species-level PSMC analysis, the method was the same, except that each pair of samples was required to come from different subpopulations (Supplementary Table 23). PSMC is based on a panmictic model, assuming random mating, in which an individual (for example, a mammal) carries haplotypes derived from different ancestors. For selfing species, the outcome of pseudo-diploid PSMC is highly dependent on IBS. The higher the IBS, the closer the relationship between the pair, and the more likely the haplotypes come from a shared ancestor, which violates the assumption of random mating in PSMC. Conversely, pairs with lower IBS values are more likely to carry haplotypes from different ancestors, making them more consistent with the PSMC model. Therefore, we used the sample pairs from the lowest IBS bin (0.60 < IBS < 0.67) to represent the average demographic history of wild barley (Fig. 2b and Extended Data Fig. 3b).

Analysis of deep divergence region on chromosome 5H

We used MUMmer (v4.0.0)66 to align eight barley genome assemblies with different haplotypes16 on chromosome 5H, 100–300 Mb. The minimum alignment identity was 90 and the minimum alignment length was 2,000 bp.

We used cross-population composite likelihood ratio (XP-CLR)22, a method for detecting selective sweeps based on allele frequency differentiation, to assess whether a selective sweep signal exists in the deep divergence region on chromosome 5H. First, we performed genotype imputation and phasing of the SNP matrix using Beagle (v5.5)67. We then applied a Python implementation of XP-CLR (https://github.com/hardingnj/xpclr) to calculate XP-CLR scores between the southern Levant population and each of the other four wild barley groups. The analysis was performed using sliding windows of 1 Mb in size (shift of 200 kb). According to our previous definition (Extended Data Fig. 4b), excluding the three intermediate haplotypes, the remaining wild and domesticated barley samples were classified into two haplotype types: haplotype1 and haplotype2. Candidate genes were identified based on the SNP matrix constructed using the wild barley accession B1K-04-02 (FT11) as the reference genome16 (Supplementary Fig. 4b). The effects of SNPs and indels residing in the genes of those regions were classified with SnpEff (v4.3t)68, and variants with high allele frequency differentiation in haplotype1 and haplotype2 were prioritized (Supplementary Table 5).

Definition of ancestral haplotype groups

AHGs were defined with IntroBlocker (v2)9. To determine an appropriate threshold for separating haplotypes, we computed coverage-normalized SNP-based distances in 1-Mb windows (shift of 200 kb): (1) among wild samples; (2) among domesticated samples; and (3) between wild and domesticated samples. In each of the three cases, all possible pairwise combinations of samples were considered. We selected a threshold of 400 SNPs per Mb to separate AHGs. Coverage normalized SNP–distance matrices computed from 367 high-coverage samples were used as input for IntroBlocker with the ‘semi-supervised’ model, giving precedence to wild over domesticated samples in the labelling of AHGs. IntroBlocker was run with different window sizes: 100 kb (shift of 20 kb), 500 kb (shift of 100 kb), 1 Mb (shift of 200 kb), 2 Mb (shift of 400 kb) and 5 Mb (shift of 5 Mb). The results of the 5-Mb run are shown in Fig. 3b and Extended Data Fig. 5b. After inspection of results (Supplementary Fig. 9), the results from the 100-kb (shift of 20 kb) run were used for downstream analyses.

Analysis of the AHG matrix

The proportions of shared and private AHGs in wild and domesticated barleys were determined with custom Perl scripts. Saturation curves were calculated as follows. We chose sets of k wild barleys (from a universe of 251 samples) at random, with k ranging from 1 to 250. For each k, the selection was repeated 100 times. For each of the samples, we determined the proportion of haplotypes seen in the domesticate that were shared with that set. Mean values and 95% confidence intervals for each k were calculated in R (v3.5.1)61based on the t-distribution (via the t.test() function). Two-dimensional haplotype frequency spectra were calculated with custom Perl scripts. Genomic windows with more than 20% missing data points were excluded.

To infer the times at which wild haplotypes entered the domesticated gene pool, we ran IntroBlocker with different thresholds for haplotype separation: 400 SNPs (equivalent to an approximate divergent time of 32,000 years ago), 98 SNPs (8,000 years), 73 (6,000 years), 49 SNPs (4,000 years) and 24 SNPs (2,000 years). For each domesticated haplotype, we compared the results from IntroBlocker runs with different thresholds (divergence time brackets). The latest bracket in which haplotype sharing between wild and domesticated samples occurred was considered a terminus post quem for when a wild haplotype type entered the domesticated gene pool. This method is agnostic about the direction of gene flow. To exclude recent introgressions from domesticated to wild barley, we removed windows in which multiple domesticated barley samples and a few wild barleys share haplotypes that diverged within the past 8,000 years. To determine the spatial origin of haplotypes, we averaged the ancestry ADMIXTURE coefficients of all wild individuals in which a given domesticated haplotype occurred (Supplementary Fig. 29). If two wild samples that shared a domesticated haplotype were highly similar (pairwise IBS ≥ 0.95), only one was used for the calculation.

Haplotype-based genetic diversity and selective sweeps

Saturation curves for the average number of haplotypes in a genomic window as a function of sample size were obtained by randomly selecting k individuals with k ranging from 1 to 115 for domesticated samples and from 1 to 250 for wild samples. For each k, the selection was repeated 100 times. Average haplotype numbers were determined for each subsample. Mean values and 95% confidence intervals were calculated in R (v3.5.1)61 based on the t-distribution (via the t.test() function). θW59 and the Shannon diversity index69 were calculated with a custom Perl script on haplotype matrices including only genomic windows with less than 20% missing data. The θW and Shannon index in seven barley chromosomes were plotted with Gnuplot using ‘smooth bezier’.

We looked for regions of reduced diversity in domesticated relative to wild barley and therein searched for genes that might have been potential targets of selection. To not bias the analysis by the use of a domesticated reference genome (that of cultivar Morex), IntroBlocker was re-run using the SNP matrix based on the wild barley accession B1K-04-02 (FT11)16. Regions with a Shannon index ≤ 1 were considered selective sweeps. The effects of SNPs and indels residing in the genes of those regions were classified with SnpEff (v4.3t)68, and variants with high allele frequency differentiation were prioritized.

The differentiation between populations of domesticated barley was assessed by computing the absolute allele frequency difference70. The following comparisons were done: NE + EU versus ETH, NE + EU versus Asia, ETH versus Asia, NE versus EUT, NE versus EUS, and EUT versus EUS. In addition, we calculated FST in genomic windows (size of 100 kb, shift of 20 kb) using the same method as in wild barley. Allele frequency difference was used for haplotypes derived from high-coverage samples (SNP1); FST calculations were performed for all samples, including low-coverage samples (SNP2).

Demographic history of domesticated barley

Trajectories of effective population size across time were inferred with PSMC8 (v0.6.5-r67, default parameters) using pseudo-diploid genome sequence from two homozygous barley individuals. A generation time of 1 year and a mutation rate of 6.13 × 10−9 were used. We ran PSMC on 341 pseudo-haploid genomes obtained from all possible permutation of sample pairs from within 15 domesticated populations to reflect the population history of each subpopulation of domesticated barley. Given that domesticated barley originates from a mosaic genome composed of diverse wild barley lineages, we used the average demographic history of wild barleys (sample pairs from the lowest IBS bin between 0.60 and 0.67 in Extended Data Fig. 3b) as a reference background to compare temporal changes in effective population size (Ne) between 15 cultivated barley groups and wild barley.

Split times between pairs of domesticated barley populations were determined by inspecting the distributions of SNP numbers between pairs of samples in those windows (size of 1 Mb, shift of 200 kb) where a given pair of samples differed by fewer than 300 SNPs (corresponding to a divergence of 24,470 years). Only 1-Mb windows in which the intersection of effective covered regions between the two samples exceeds 0.9 Mb were retained for SNP number calculation.

The SNP number distribution was visualized by frequency polygons (linear binning; number of bins of 50; range of 0–300). SNP numbers were converted to divergence time using the following formula: time = (SNP number per Mb/106)/(2 × 6.13 × 10−9), where the 6.13 × 10−9 was the random mutation rate (μ) of B. distachyon62.

Validation of inferred split times

We used a previously published two-rowed ancient barley sample, JK301471 (approximately 6,000 years old, from Israel), to assess the accuracy of our method (Supplementary Fig. 30). JK3014 was chosen because it is a high-depth sequenced sample (102×) and underwent uracil–DNA–glycosylase (UDG)72 treatment, which reduces post-mortem DNA damage. JK3014 was jointly analysed with 116 high-depth modern barley samples for SNP calling. SNPs were filtered using the same preprocessing criteria that we applied in our SNP number calculation. We then calculated the SNP number between JK3014 and each of the 116 samples, without excluding C→T and G→A substitutions. The analysis used a 1-Mb sliding window (shift of 200 kb). To convert the SNP number to time, we used two models:

Model 1 assumes JK3014 is a direct ancestor of modern two-row Israel barley (ISR-THS). In this case, time = d/μ, where d equals the SNP number in 1-Mb windows/106 and μ is the mutation rate.

Model 2 assumes JK3014 and ISR-THS share a common ancestor, and their divergence time slightly predates 6,000 years ago. In this case, time = d/(coefficient × μ). If JK3014 was a modern barley sample, the coefficient would be 2. Therefore, a reasonable estimate for this coefficient lies between 1 and 2. We used 1.2 to approximate a divergence time slightly earlier than 6,000 years ago. In addition, as UDG treatment cannot entirely eliminate ancient DNA damage, we assumed 10% of the C→T and G→A SNPs might be false positives. Thus, the final equation for model 2 becomes: time = (d/1.1)/(1.2 × μ).

Estimation of haplotype age for domestication genes

We used GEVA (v1)30 to estimate the age of haplotypes associated with three domestication genes in barley. For GEVA, the alternative allele is assumed to be the derived allele. As the domesticated haplotypes of these genes in domesticated barley are all recessive mutations compared with wild barley, we used the SNP matrix based on the wild barley reference genome B1K-04-12 (FT11)16. This setup ensures that the causal variant of the domesticated haplotype is treated as the derived allele. Phasing of the SNP matrix was performed using Beagle (v5.5)67. For genes with known causal variants, we applied the following strategies to estimate haplotype age: if the causal variant is a SNP (for example, vrs1.a3 and ppd-H1), we directly used GEVA to estimate the age of that SNP. If the causal variant is a short indel (for example, btr1, btr2, vrs1.a1 and vrs1.a2), we constructed pseudo-SNPs at the indel position (for example, for the 1-bp deletion at position 41,130,358 in btr1, C/−), such as C→A, C→T and C→G, and estimated their ages using GEVA.

In both SNP and indel cases, we also identified haplotype-specific private SNPs that are in complete linkage with the causal variant and used these SNPs to estimate haplotype age. The defining feature of a causal variant is that it is private to the focal population and has a genotype frequency of 100%. ‘Private’ refers to those found exclusively in the focal haplotype relative to all other barley samples, including both wild and domesticated barley. The SNPs that we selected as being in ‘complete linkage with the causal variant’ share these same characteristics: they are private to the population and occur at a genotype frequency of 100%. Therefore, these SNPs probably originated either before or concurrently with the causal variant and can be used alongside it to estimate the age of the haplotype. The actual age of the haplotype is thus equal to or later than the age estimated by this method. For each haplotype, we randomly selected approximately 40 private SNPs, as well as the causal SNP or pseudo-causal SNPs for the calculation (Supplementary Table 14). For large deletions (for example, Nud), haplotypes with unknown causal variants (for example, vrs1.a4) and functional (dominant) haplotypes in cultivated barley (Vrs1.b2, Vrs1.b3 and Nud), we estimated haplotype age using approximately 40 private SNPs specific to the domesticated haplotypes. To avoid confounding effects from recombination, we excluded all domesticated samples showing evidence of recombinant haplotypes in the regions of interest (Supplementary Table 13). GEVA analyses were performed using default parameters, and downstream filtering was conducted using the ‘estimate.R’ script provided in the GEVA package. The mutation rate that we used is 6.13 × 10−9 from B. distachyon62. For each SNP, ten replicate runs were performed with different random seeds. Because recombinant haplotypes were excluded from the domesticated haplotype analyses, we reported haplotype ages based on the mutation clock model. Finally, given that barley is a highly selfing species with negligible heterozygosity (that is, nearly haploid in effect), and GEVA was originally developed under a diploid model (for human data), we multiplied all age estimates by 2 to account for ploidy differences and to report the final haplotype age.

As a control group, for each gene locus, we randomly selected approximately 40 SNPs (0.2 < allele frequency < 0.5) from wild barley within the same genomic region and estimated their ages (Supplementary Table 15). Given their uncertain origin — either recent or ancient in the absence of selection — low-frequency SNPs are less suitable as reliable controls. By contrast, high-frequency SNPs (for example, those with frequencies above 20%) are likely to have arisen in the past and become fixed or nearly fixed in the population, and thus are expected to exhibit older ages. For wild barley SNP, the joint mutation and recombination clock model were used. In addition, the recessive ppd-H1 allele, which may predate domestication34, was also included as a control group.

To infer the most likely spatial origins of three genes, a neighbour-joining tree for each gene was constructed with SNPs from an interval within their sweep region. For the btr1/2, vrs1 and nud loci, the interval extended from 39.4 to 39.7 Mb on chromosome 3H, from 570.5 to 517.2 Mb on chromosome 2H and 525.3–525.7 Mb on chromosome 7H, respectively. The neighbour-joining tree was constructed using SNPs based on the MorexV3 reference (SNP1).

Archaeological excavations

We analysed ancient DNA sequences of 23 barley grains excavated at three archaeological sites in Israel (Supplementary Table 16). This number included published data of five barley grains from Yoram Cave71. Archaeobotanical procedures were performed as described by Lev-Marom et al. (manuscript in preparation). The sites Yoram Cave and Timna 34 have been described by Mascher et al.71 and Lev-Marom et al. Abi’or Cave is a medium-sized cave located on the eastern slopes of the Judean Desert, above Jericho, approximately 50 m below sea level, across from the Karantal Monastery. The excavations at the cave were directed by the late H. Eshel in 1986. It is situated above a larger cave known as ‘The Spies Cave’ and has three openings above it. The cave contains a main long tunnel, approximately 50 m long, and has revealed archaeological material dating from the Chalcolithic period to the time of the Bar Kochba Revolt (2nd century ce). The cave was found to be heavily disturbed by animals, antiquities robbers and monks who lived in it during the Islamic and more recent periods.

Ancient DNA sequencing and analysis

All laboratory procedures for sampling, DNA extraction, library preparation and library indexing were conducted in facilities dedicated to ancient DNA work at the University of Tübingen. Before DNA extraction, all seeds were cut into two parts: one part of each seed (36-6.5 mg) was used for DNA extraction and further processing, the other part (26-3.4 mg) was used for radiocarbon dating at the Klaus-Tschira-Archäometrie-Zentrum, Curt-Engelhorn-Zentrum Archäometrie gGmbH. DNA extraction was then performed according to a well-established extraction protocol for ancient plant material71 and double-stranded dual-indexed DNA libraries were produced73,74. Six ancient DNA samples (TU697 and JK2281-JK3014) were treated with UDG72 before sequencing. Sequencing was done on Illumina devices at IPK Gatersleben, the University of Tübingen and the Max-Planck Institute or the Science of Human History Jena.

Paired-end Illumina reads of each sample were merged with leeHom (v1.2.17)75 and mapped to the MorexV3 genome sequence assembly15 using Minimap2 (v2.24)48. BAM files were sorted and duplicates were marked with Novosort (v3.06.05; https://www.novocraft.com/products/novosort/). Nucleotide misincorporation profiles were generated with mapDamage (v2.0.8)76. Variant calling was done with bcftools (v1.15.1)49 using the command ‘mpileup -a DP,AD -q 20 -Q 20 –ns 3332’. We omitted the parameter ‘–variants-only’ in ‘bcftools call’ to output genotype in all sites. C→T and G→A were excluded, where the C and G are the alleles in the reference genomes and T and A are the alternative alleles called from the short-read data. The resultant SNP matrix was merged with the three different SNP matrices: SNP1 (367 high-coverage samples), SNP2 (302 domesticated barleys) and a published SNP matrix constructed from GBS data of 19,778 domesticated barleys24. The GBS matrices had been filtered for site-level missing rate (less than 20%) before merging. The merged SNP1 matrix was used for PCA with smartPCA (v7.2.1)44 using the parameter ‘lsqproject: YES’. Neighbour-joining trees were constructed using only SNPs in a 50-Mb region flanking the centromeres (±25 Mb) on each of the seven chromosomes and including only six high-coverage ancient DNA samples, to determine the proximal haplotypes of ancient barley. The merged GBS matrix was used to compute an IBS matrix with PLINK (v1.9)45. To examine the phylogenetic relationships between ancient DNA and modern domesticated barley, we constructed genome-wide phylogenetic trees using two merged SNP datasets: SNP1 and SNP2, each incorporating ancient DNA samples.

To compare genetic diversity between individual ancient and modern barley samples without relying on population-level statistics, we leveraged rare alleles identified in a comprehensive wild barley panel as proxies for ancestral diversity (Supplementary Fig. 18). Wild barley has the most extensive reservoir of allelic variation; alleles with very low frequency in this panel (for example, 0 < MAF ≤ 0.01) are unlikely to persist through strong bottlenecks or selective sweeps, and thus serve as sensitive markers of lost diversity. For each sample pair, we counted the number of these wild-derived rare alleles present in the ancient genome (A) and in the modern genome (M), and defined the ‘relative diversity change’ as (M – A)/A. A positive value indicates retention or gain of ancestral diversity in the modern sample, whereas a negative value signifies diversity loss relative to the ancient sample. This approach allows us to quantify diversity change at the single-sample level in a straightforwards, interpretable manner, without requiring large cohort sizes or population-based diversity estimators. We calculated the relative change in genetic diversity between six high-coverage ancient samples and modern domesticated barley individuals from 15 populations.

The merged SNP1 and SNP2 were also used for the calculation of D statistics with the qpDstat program of ADMIXTOOLS (v3.0)77. On the basis of previous phylogenetic analyses, we identified ISR-THS as the closest modern barley population to both Yoram Cave and Timna 34, and ME-SHS as the closest to Abi’or Cave. To test for potential gene flow between ancient and modern barley, we performed the following three D statistics analyses: D (ISR-THS, Yoram Cave; P3, H. pubiflorum), D (ISR-THS, Timna 34; P3, H. pubiflorum) and D (ME-SHS, Abi’or Cave; P3, H. pubiflorum). Here P3 refers to any of the 14 modern barley populations other than ISR-THS or ME-SHS, and H. pubiflorum is the outgroup.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *