Data-driven de novo design of super-adhesive hydrogels

Hydrogel fabrication

All copolymer gels were synthesized by one-step free-radical copolymerization of monomers with a chemical crosslinker. The crosslinker concentration was fixed at 0.1 mol% relative to the total monomer content to balance the elasticity and deformability of the gels27. DMSO solutions containing functional monomers (total concentration of 2.4 M) with compositions derived from DM and ML (Supplementary Tables 2 and 7), chemical crosslinker (glycerol 1,3-diglycerolate diacrylate, 2.4 mM), and UV initiator (2-oxoglutaric acid, 6 mM) were used. For example, to prepare the G-max gel, 1.819 g of BA, 0.413 g of HEA, 0.264 g of CBEA, 0.561 g of ATAC, 0.441 g of PEA, 8.4 mg of glycerol 1,3-diglycerolate diacrylate and 8.8 mg of 2-oxoglutaric acid were added to a 10 ml volumetric flask, followed by DMSO to reach 10 ml. The precursor solution was transferred to a glove box to remove oxygen, poured into a reaction cell (two 10 cm × 10 cm glass plates, 0.5-mm spacing) and irradiated with UV light (365 nm wavelength, 4 mW cm−2 intensity) for 8 h to form gels (Supplementary Fig. 9a). After UV irradiation, over 99% of the monomers were converted into polymers, as confirmed by NMR (Supplementary Fig. 9b).

The as-prepared organogels were then immersed in normal saline (0.154 M NaCl) to remove solvent and residual chemicals, with the saline exchanged every 12 h for at least 2 weeks until swelling equilibrium was reached. Hydrogels were stored in normal saline before use.

Underwater adhesion characterization

The tack test was conducted using a SHIMADZU tester (Autograph AG-X) equipped with Trapezium X software. Hydrogel (0.3–0.8 mm thickness) at swelling equilibrium was adhered to the probe using cyanoacrylate adhesive (super glue). For rapid screening, DM-driven hydrogels from the training round and ML-driven hydrogels from three optimization rounds, were prepared as 15 mm diameter samples. For detailed adhesion studies, 10 mm diameter samples were used to avoid exceeding the force range of the instrument. This change in diameter did not affect the adhesive strength results. The hydrogel on the probe was then immersed in a test solution (for example, normal saline) for 5 min to reach equilibrium. The probe descended towards the substrate at 1 mm min−1 until a loading force of 10 N was applied, maintained for 10 s and withdrawn at 10 mm min−1 (Supplementary Fig. 10). These test conditions were used as a standard protocol unless otherwise specified. For repeated adhesion tests, hydrogels rested underwater for 5 min between cycles, with glass substrates replaced every 100 tests. For prolonged attachment–detachment cycles (Extended Data Fig. 8), a 5 N loading force and a 10 s contact time were used to minimize gel fatigue. Each sample was tested at least three times. For hydrogel dataset construction, the highest adhesive strength recorded for each sample was reported as Fa, representing maximum adhesion performance under the specific conditions.

Lap shear adhesive strength was measured using a universal testing machine (UTM, INSTRON 5965). A hydrogel (10 mm diameter, area A = 78.5 mm2) at swelling equilibrium was sandwiched between two glass slides, pressed at 20 N for 1 min in normal saline. Shear loading was applied at 50 mm min−1. Shear adhesive strength (Fa) was calculated as Fa = Fmax/A, where Fmax is the maximum loading force. For adhesion durability tests (Supplementary Fig. 15), the sandwiched assembly was stored in normal saline for varying durations before testing.

Interfacial toughness was measured by 180° peeling tests using INSTRON 5965. Hydrogel strips (10 mm × 150 mm) were adhered to a glass substrate in normal saline using mild finger pressure, followed by a 2 kg hand roller applied in each direction for 1 min to ensure uniform contact. Polyethylene terephthalate (PET) films (50 μm thickness) served as a stiff backing. Peeling tests were conducted at 50 mm min−1. Interfacial toughness (Gc) was calculated as Gc = 2Fc/w, where Fc is the plateau force and w is the sample width (10 mm).

DM of adhesive proteins

A comprehensive dataset of adhesive proteins was compiled from the NCBI protein database, using ‘adhesive proteins’ as the query keyword. A total of 24,707 protein sequences from 3,822 different organisms (bacteria, viruses, eukaryotes and animals) were collected without additional data cleaning. Based on taxonomy annotations, proteins were grouped by species, and a consensus sequence was generated for each species to capture common sequence patterns and reduce the influence of individual variations.

The dataset included 3,111 species, noting that taxonomic overlap results in protein counts not summing to 24,707. For robust analysis, the top 200 species, ranked by the number of distinct proteins identified per species, were selected for further study.

Protein sequences were exported in FASTA format45 using the Bio.SeqIO interface in BioPython46. Consensus sequences were computed with Clustal Omega23, which performs multiple sequence alignment by generating a distance matrix from pairwise alignments, constructing a guide tree based on evolutionary relationships and progressively aligning sequences from the closest to the most distant. The resulting alignment identifies the most frequent residues at each position, yielding a consensus sequence that highlights conserved regions.

Clustal Omega was executed with the command:

$$./{\rm{c}}{\rm{l}}{\rm{u}}{\rm{s}}{\rm{t}}{\rm{a}}{\rm{l}}{\rm{o}}\, \mbox{-} {\rm{i}}\,{\rm{ \mbox{“} }}{\rm{i}}{\rm{n}}{\rm{p}}{\rm{u}}{\rm{t}}{\rm{\_}}{\rm{f}}{\rm{i}}{\rm{l}}{\rm{e}}{\rm{\mbox{”}}}\, \mbox{-} \mbox{-} {\rm{o}}{\rm{u}}{\rm{t}}{\rm{f}}{\rm{m}}{\rm{t}}\,=\,{\rm{c}}{\rm{l}}{\rm{u}}\, \mbox{-} {\rm{o}}\,{\rm{ \mbox{“} }}{\rm{o}}{\rm{u}}{\rm{t}}{\rm{p}}{\rm{u}}{\rm{t}}{\rm{\_}}{\rm{a}}{\rm{l}}{\rm{n}}{\rm{\_}}{\rm{f}}{\rm{i}}{\rm{l}}{\rm{e}}{\rm{\mbox{”}}}\, \mbox{-} {\rm{v}}$$

where “input_file” and “output_aln_file” denote the input protein sequences and output consensus sequences, respectively. The 200 consensus sequences generated were used for subsequent sequence analysis and hydrogel formulation design.

ML methods

A six-dimensional feature vector, ϕi = [ϕBA, ϕHEA, ϕCBEA, ϕATAC, ϕAAm, ϕPEA], was used to represent monomer proportions in hydrogels. The target variable was adhesive strength, Fa. To model the relationship between ϕi and Fa, we explored both linear and non-linear ML models (Supplementary Tables 5 and 6).

Linear models included least absolute shrinkage and selection operator regression (Lasso) and ridge regression (Ridge). Non-linear models comprised k-nearest neighbours (KNN), kernel ridge regression (KRR), support vector regression (SVR), random forest regression (RFR), gradient boosting regression with XGBoost (XGB), extra trees regression (ETR) and Gaussian process (GP) with a Matérn kernel32,34. These non-linear models encompass non-parametric (KNN), kernel-based (KRR, SVR and GP) and tree-ensemble (RFR, XGB and ETR) approaches, enabling a comprehensive comparison34,35,47.

XGB was of v.1.6.2, whereas the other models were implemented using Scikit-learn (v.1.0.2) and Scikit-optimize (v.0.9.0). The hyperparameter n_estimators was tuned using Optuna48, whereas others were optimized using grid search (Supplementary Table 6). A 10-fold cross-validation strategy was used to assess predictive performance on our dataset of 180 hydrogels, using root mean squared error (RMSE) as the metric. GP and RFR, with the lowest RMSE in training-test error using a 90%/10% train/test split (Extended Data Fig. 4), emerged as the top performer and runner-up, respectively, and were subsequently used as the base (surrogate) models.

To make extrapolative predictions, we tried three types of methods.

  1. 1.

    Exploitation-only enumeration:

    Ten million ϕi vectors were generated from a uniform distribution [0, 1.0) for each monomer, normalized to sum to 1.0. The top five vectors, ranked by predicted Fa from each model, were experimentally validated.

  2. 2.

    Batched BO:

    • GP_KB: used GP predictions as the hypothetical values for selecting the next data points maximizing EI.

    • GP_CLmax: used the maximum Fa (y_max) from the training set as a hypothetical value for selecting the next data points with EI maximums.

    • GP_CLmin: used the minimum Fa (y_min) for selecting the next data points with EI maximums.

    • GP_LP: incorporated a locally penalized term in EI calculation37.

    GP_KB, GP_CLmax and GP_CLmin simplified the joint q-EI probability calculation36 by using the GP prediction value as a hypothetical value for selecting the next data points with EI maximums. A batch size of q = 10 was selected.

  3. 3.

    Batched sequential model-based optimization (SMBO):

    • GP-RFR: GP as the hypothetical value provider and RFR as the EI maximizer.

    • RFR-RFR: RFR as both the hypothetical value provider and the EI maximizer.

    • RFR-GP: RFR as the hypothetical value provider and GP as the EI maximizer.

    • RFR-GP*: RFR-GP with a warm start, 10 RFR-generated points were added to the real dataset for GP regression.

    • RFR-ETR: RFR as the hypothetical value provider and ETR as the EI maximizer.

    • RFR-GBM: RFR as the hypothetical value provider and GBM as the EI maximizer.

    SMBO iteratively updates the surrogate model while exploring promising data points33. GP and RFR, when used as the hypothetical value providers, balance exploitation and exploration, whereas GP_CLmax and GP_CLmin emphasize exploitation and exploration, respectively49.

SMBO (Supplementary Algorithm 1) consists of four components: the true function (f), global domain (X), acquisition function (S) and surrogate model (M). Initial training data (D) are sampled from X, and experimental Fa values are obtained (line 1). The surrogate model M is fitted to D (line 3) and S (EI) identifies the next data point based on predictive uncertainty (line 4). This data point is subsequently validated experimentally (line 5), updating D (line 6) for T iterations (line 2).

EI quantifies expected improvement, \({\int }_{y* }^{\infty }(y-{y}^{* })p(y){\rm{d}}y\), over the current best target (y*). Owing to the time-intensive nature of hydrogel fabrication (each takes about 2 weeks), GP and RFR were used as the hypothetical value providers, enabling the maximization of the joint q-EI probability without requiring new experiments per iteration. EI maximizers (GP, RFR, ETR and GBM) used hyperparameters from Scikit-optimize (v.0.9.0).

For GP as the EI maximizer, the limited-memory Broyden–Fletcher–Goldfarb–Shannon (L-BFGS-B) algorithm50 was executed 20 times per iteration (40 iterations total) to identify the point with the highest EI, updating the GP prior. For the other three EI maximizers (RFR, ETR and GBM), 10,000 points were randomly sampled per iteration, as numerical optimization is more suitable for tree-ensemble models lacking gradient information. SMBO ran for 40 iterations with each EI maximizer, selecting two sets of 10 data points in each iteration: the top 10 ranked by EI values (batch size q = 10), and the top 10 ranked by predicted Fa values for experimental validation. These two sets may overlap, and the total number of data points may be less than 20.

For BO methods (GP_KB, GP_CLmax, GP_CLmin and GP_LP), the procedure was similar, except that the hypothetical value provider was either GP itself (GP_KB and GP_LP) or constant values (y_max for GP_CLmax and y_min for GP_CLmin).

After the first round, 109 validated points expanded the dataset to 289 hydrogels. The second and third rounds added 27 and 25 points, respectively, resulting in a final dataset comprising 341 hydrogels.


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *