Designing lipid nanoparticles utilizing a transformer-based neural community

COMET particulars

This part describes the mannequin structure and coaching algorithms of COMET. Pseudocode for inference is offered in Algorithm S1.

COMET mannequin structure

Lipid molecular constructions are encoded into high-dimensional vectors (molecular embeddings), whereas scalar compositional options are encoded utilizing a Gaussian-based encoder⁵³. Steady formulation-wide parameters (for instance, N/P ratio and volumetric combine ratio) are encoded with Gaussian layers; categorical inputs use one-hot embeddings.

The transformer makes use of a [CLS] token to combination enter options throughout a number of consideration layers. For multitask studying, every cell kind is assigned a separate [CLS] token and prediction head, enabling task-specific outputs whereas sharing LNP-level illustration studying.

Molecular encoder

COMET is suitable with varied molecular encoders; right here we use Uni-Mol¹¹, pretrained to get better masked atom sorts and corrupted three-dimensional coordinates. It presents robust property prediction efficiency and is used with default hyperparameters (from https://github.com/dptech-corp/Uni-Mol/tree/most important/unimol). Pretrained weights are frozen throughout COMET coaching. Every compound is encoded right into a 512-dimensional vector utilizing atom sorts and coordinates.

Lipid molar percentages are encoded into 128-dimensional vectors utilizing a shared Gaussian layer. Every part is additional assigned a 128-dimensional one-hot embedding (({z}_{ok}^{{rm{kind}}})) to differentiate lipid lessons. These are concatenated and projected by a two-layer MLP right into a 256-dimensional part illustration.

N/P ratio and volumetric ratio

N/P ratio is encoded utilizing a separate 256-dimensional Gaussian layer (z_N/P). Aqueous/natural ratios, handled as categorical variables, are one-hot encoded (z_section) with 256 dimensions.

CLS token and prediction head

Every cell kind makes use of a realized [CLS] token (z_CLS) of dimension 256. These combination part and formulation-wide token representations throughout N_block transformer layers through consideration⁵⁴. Closing predictions are made by passing the [CLS] token by a two-layer MLP (MLP_predict).

Transformer blocks

Every block follows a Pre-LayerNorm construction⁵⁵ composed of layernorm → self-attention → MLP with residual connections.

Coaching particulars

The mannequin is educated with a binary rating goal⁵⁶ the place, given a pair of LNP samples, the mannequin learns to foretell a bigger efficacy rating for the LNP that has a better efficacy label worth from the opposite LNP:

$${{mathcal{L}}}_{mathrm{rating}}=-log left(sigma (;{f}_{theta }({x}_{mathrm{h}})-{f}_{theta }({x}_{mathrm{l}}))proper.$$

(1)

the place x_h and x_l are high- and low-efficacy LNPs and f_θ is COMET’s scoring operate. Coaching makes use of a batch measurement of 64 (2,016 pairwise comparisons per batch).

Battle-averse gradient descent

Battle-averse gradient descent (CAGrad)³³ mitigates conflicting gradients in multitask settings. We apply CAGrad with a coefficient of 0.2 to stabilize coaching throughout duties.

Noise augmentation

To handle noise within the experimental information, particularly within the fluid dealing with course of, we increase the molar share with Gaussian noise proportionate to its worth the place the usual deviation of the noise is 10% of precise molar share.

Label margin

From the label values, we will inform not solely which LNP is best than one other but in addition by how a lot. To coach the mannequin to be taught this extra information, we embrace a margin time period⁵⁷ within the binary rating goal:

$${{mathcal{L}}}_{mathrm{rating}}=-log left({rm{Sigmoid}}(;{f}_{theta }({x}_{mathrm{h}})-{f}_{theta }({x}_{mathrm{l}})-{lambda }_{mathrm{margin}}(;{y}_{mathrm{h}}-{y}_{mathrm{l}}))proper.$$

(2)

the place y_h and y_l are the (efficacy) label values of the extra efficacious and fewer efficacious LNP, respectively, and λ_margin controls how a lot this goal dominates the coaching. We use λ_margin = 0.01 in our experiments.

Ensembling

For in silico analysis (Fig. 3e–l), the ensemble is fashioned by N_mannequin fashions educated with the identical hyperparameters and dataset (prepare/legitimate/take a look at cut up) however weights initialized with completely different random seeds. For the ensemble deployed to deduce digital LNPs, 5 completely different prepare (80%)/legitimate (20%) splits are made in a fivefold method and every mannequin within the ensemble is educated on a unique fold. To make sure that ensembled scores aren’t biased in the direction of fashions with excessive variance, the anticipated scores from every mannequin are normalized by making their scores for the LANCE LNPs match a standard distribution with imply 0 and customary deviation 1 earlier than ensembling. Extra particularly, for every mannequin, that is completed by inferring the anticipated scores on all of the LANCE LNPs and utilizing the imply (imply_i) and customary deviation (std_i) of LANCE LNPs’ scores to compute the normalized scores ({y}_{i}^{{prime} mathrm{normalized}}) by

$${y}_{i}^{{prime} mathrm{normalized}}=frac{{y}_{i}^{{prime} }-{mathrm{imply}}_{i}}{mathrm{std}_{i}},quad i sim {1,{.}{.}{.},{N}_{{rm{mannequin}}}}$$

(3)

The ultimate ensemble rating is the imply of all fashions’ normalized scores:

$${y}^{{prime} mathrm{ensemble}}=frac{1}{{N}_{{rm{mannequin}}}}mathop{sum }limits_{i}^{{N}_{{rm{mannequin}}}}{y}_{i}^{{prime} mathrm{normalized}}$$

(4)

COMET is carried out in PyTorch and educated with NVIDIA V100 GPUs.

ok-Nearest neighbours and random forest mannequin particulars

The ok-nearest neighbours and random forest fashions are carried out with the scikit-learn (https://scikit-learn.org/) package deal, with default hyperparameters. Extra particularly, the ok-nearest neighbours mannequin makes use of n = 5 nearest neighbours whereas the random forest mannequin makes use of n = 100 estimators (bushes).

LANCE dataset particulars

LANCE contains 4 components spanning orthogonal LNP design dimensions: lipid part identities, molar percentages, synthesis parameters (for instance, N/P and aqueous/natural volumetric ratios) and high-resolution molar sweeps.

Seven ionizable lipids, three sterols, two helper lipids and two PEG lipids have been used (Supplementary Desk 14), reflecting the main target of present analysis^12,42. To check molar % results, we designed 13 lipid ratios by various one lipid class at a time from a reference BASE ratio (Fig. 1d), based mostly on ref. ²⁰. As an illustration, ratios I1–I4 modify ionizable lipid %, C1–C3 regulate ldl cholesterol (compensated by helper lipid), and P1–P3 alter PEG lipid %, whereas the remaining modify a number of elements (Supplementary Desk 13).

Half 1 (lipid alternative)

To look at lipid id results, we generated 84 combos from all permutations of seven ionizable lipids, 3 sterols, 2 helper lipids and a couple of PEG lipids. Paired with 13 molar ratios, this ends in 1,092 potential LNPs; 1,066 have been examined. After eradicating 91 overlapping with half 2, this half yielded 975 distinctive LNPs.

Half 2 (ionizable lipid synergy)

Following research suggesting synergy from dual-ionizable lipid formulations⁵⁸, we created LNPs with 60:40 molar splits throughout all ionizable lipid pairs, distributed throughout 13 lipid ratios. This yielded 637 further LNPs.

Half 3 (key synthesis parameters)

To discover synthesis results, we launched variation in ionizable lipid/RNA weight ratios (10:1, 15:1 and 20:1) and aqueous/natural section ratios (1:1 and three:1). Weight ratios have been adjusted by molar mass to take care of equal molar %. These parameters have been later transformed to N/P ratios for mannequin enter. This half consists of 924 LNPs.

Half 4 (molar share sweeps)

To check finer-grained molar % results, we created 24 evenly spaced intervals from 10% to 80% for ionizable lipid, ldl cholesterol and helper lipid, producing 492 LNPs throughout 3 centered sweeps.

Formulation ratios

Single-ionizable LNPs span 18 distinctive N/P ratios, derived from 3 ionizable lipid/RNA weight ratios and seven ionizable lipids. Twin-ionizable formulations add 63 extra, totalling 81 N/P ratios. In molar phrases, 13 base lipid ratios and 72 sweep ratios (24 per lipid class) end in 85 whole molar compositions.

LNP synthesis

LNPs have been synthesized by mixing lipid–ethanol and mRNA–citrate buffer phases, incubated at 4 °C for 10 min. Automated dealing with was carried out on the Tecan Fluent platform. For animal research, LNPs have been blended, incubated on ice for 10 min and dialysed in a single day at 4 °C in PBS (Slide-A-Lyzer, ThermoFisher).

Supplies

FLuc mRNA (L-7202, Trilink); lipids (Cayman Chemical compounds, Avanti); luciferase assay (Regular-Glo, E2550) and Agilent BioTek plate reader for readout. alamarBlue was used for viability assays.

Information processing

Every 96-well plate included a ‘customary’ LNP. Uncooked luminescence values have been normalized to the usual and averaged throughout 4 replicates (two organic, two technical). Imply values have been log-transformed and min–max normalized to [0, 1].

We’ve got represented a number of key options of the LANCE dataset in Fig. 2. Under, we clarify how these key options have been extracted from LANCE. For Fig. 2a, half 1 formulations have been chosen. For the 4 ionizable lipids (ALC-0315, DLin-MC3-DMA, C12-200 and CKK-E12), we had 156 formulations containing 2 helper lipids, 3 sterol lipids and a couple of PEG lipids (that’s, 2 × 2 × 3 = 12 combos) at 13 molar ratios (12 × 13 = 156 formulations).

For Fig. 2b,c, half 3 formulations containing one ionizable lipid, ldl cholesterol and C14-PEG have been chosen. Two molar ratios of the lipid elements (that are proven within the determine) have been studied. The ionizable lipid to mRNA molar ratio was 10,162. The aqueous to natural quantity ratio was diversified. For Fig. second, half 3 formulations containing one ionizable lipid, DOPE, ldl cholesterol and C14-PEG have been chosen. Two molar ratios of the lipid elements (that are proven within the determine) have been studied. The natural to aqueous quantity ratio was held at 1:3.

Determine 2e was generated from half 2 information. Solely formulations containing DOPE, ldl cholesterol and C14-PEG have been used for the graph. The identify of the primary ionizable lipid was listed because the title of graph and the second ionizable lipid identify was the row identify. The full molar content material of the ionizable lipids was the column identify. The molar ratio of ionizable lipid 1/ionizable lipid 2 is 1.5. The molar ratio of DOPE/ldl cholesterol was 0.34. The molar % of C14-PEG was 2.5%. The molar ratio of ionizable lipid/mRNA was 10,162. Your entire library was used to assemble Fig. 2f. We calculated the normalized transfection efficacy for the thirtieth and seventieth percentile formulations in B16-F10 and DC2.4 cells. These values have been as follows: seventieth percentile, B16-F10 = 0.43887; thirtieth percentile, B16-F10 = 0.24315; seventieth percentile DC2.4 = 0.64623; thirtieth percentile DC2.4 = 0.30946. Formulations above and beneath these values within the respective cell traces have been chosen and are plotted in Fig. 2f.

In vitro validation particulars

The LNPs are named based on the teams to which they belong. A abstract of the prefixes used right here is given in Supplementary Desk 16.

Clinically authorized LNP baselines

The recipes for the three scientific LNP baselines are based mostly on the literature³⁹ and synthesized in an aqueous/natural volumetric of three:1 following what is usually utilized in earlier work.

Prime LANCE LNP hits baselines

To seek out robust and dependable LNP baselines from LANCE, we randomly choose 10 LNP formulations from the ninetieth percentile for every cell line to once more display them with the respective cell line to test for reproducibility. Amongst these ten formulations, three LNPs with their normalized efficacy worth closest to their unique LANCE efficacy label values have been chosen as LANCE baseline LNPs.

Exploratory LNP library

To span an enormous formulation area, the digital library was generated by enumerating by potential LNP options akin to lipid decisions, their molar percentages and key synthesis parameters akin to N/P ratios and aqueous/natural volumetric ratios, based on Supplementary Desk 15. To seek out LNPs which can be completely different from the hits within the LANCE dataset, formulations inside a ten% L1 distance lipid molar share neighbourhood of any high 10% most efficacious LANCE hits have been excluded. After this step, the exploratory library has 27,354,600 and 34,539,960 formulations for DC2.4 and B16-F10, respectively. An ensemble of 5 COMET fashions predicted efficacy in each cell traces. The highest 0.1% highest-scoring LNPs have been chosen (34,529 B16-F10 and 27,354 DC2.4).

The following step removes formulations based mostly on uncertainty in COMET prediction. We seize the extent of uncertainty by first computing the usual deviation (σ) between the fashions’ prediction (({y}_{i}^{{prime} mathrm{normalized}}) in equation (3)) inside the ensemble. We then scale the usual deviation by division with a non-negative predicted efficacy time period to get a relative uncertainty worth (u_rel):

$${u}_{{rm{rel}}}=frac{sigma }{{hat{y}}^{rm{ensemble}}},quad {hat{y}}^{rm{ensemble}}={y}^{{prime} rm{ensemble}}-{y}^{{prime} rm{ensemble,min,LANCE}}$$

(5)

the place ({y}^{{prime} mathrm{ensemble,min,LANCE}}) is the minimal ensemble rating among the many LANCE LNPs. Any formulations with destructive ({hat{y}}^{mathrm{ensemble}}) time period have been dropped. Supplementary Fig. 13 reveals the distribution of this relative uncertainty worth. Formulations with largest 50% relative uncertainty values have been eliminated, leaving 17,269 B16-F10 and 13,677 DC2.4 formulations.

To advertise chemical variety, Ok-means clustering (on 14-dimensional vectors encoding lipid molar percentages) grouped these candidates into 10 clusters. Clustering was repeated 1,000 occasions to stabilize assignments. The very best-scoring formulation in every cluster was chosen, leading to ten numerous in silico hits per cell line (Supplementary Tables 17 and 18).

Lead optimization LNP library

For every cell kind, three high LANCE hits (from ‘Prime LANCE LNP hits baselines’ part) have been used as beginning factors. Round every, digital candidates have been generated by (1) exploring inside a 20% L1 molar share distance, (2) substituting at the very least one lipid (6 ionizable lipids, 2 cholesterols, 1 helper and 1 PEG) and (3) altering the N/P ratio.

To generate three numerous candidates per lead, we segmented the neighbourhood into three zones: (1) molar % phase (inside 20% L1, no lipid modifications), (2) substitute-lipid phase (inside 20% L1, however with at the very least one completely different lipid) and (3) N/P ratio phase (differing N/P ratio). From every zone, the highest predicted LNP was chosen (Fig. 3d, proper). This yielded three optimized LNPs per lead. The digital library measurement ranged from 1.5 million (single-ionizable lipid) to 9 million (dual-ionizable lipid) candidates. The sixfold improve in dual-ionizable lipid circumstances arises from combinatorial enumeration: every minor ionizable lipid was paired with six main ones. In contrast, single-ionizable lipid compositions require no pairing. The ultimate chosen formulations for validation are listed in Supplementary Tables 19 and 20.

PBAE synthesis

The compositions and molar ratios of amines, diacrylates and branching brokers are listed in Supplementary Desk 21. To synthesize PBAE polymers, the mix of the amines, diacrylates and branching brokers have been used. Briefly, in a 20 ml glass vial, the whole weight of diacrylate and branching agent was added. Then, the solvent (dimethylformamide) was added to the response combination. Later, the response vials have been positioned on a hotplate at 90 °C. After 24 h, the vials have been faraway from the hotplate and cooled to room temperature. The amines have been added to the response vial and positioned again on the hotplate at 90 °C and the response was allowed to proceed for 48 h. Lastly, the vials have been faraway from the hotplate and allowed to chill to room temperature. Then, the response combination was added (drop-by-drop) right into a beaker containing ~150 ml ice-cold diethyl ether (~10× extra quantity). The collected samples have been transferred to 50 ml tubes and centrifuged at 1,000 × g for 3 min to pellet the polymer. Later, the supernatant was eliminated and dissolved within the minimal potential quantity of dimethylformamide. This purification step was repeated thrice. Closing polymers have been dried below vacuum and solubility examined in ethanol.

Representing PBAEs in COMET

PBAEs have been represented as a mixture of their diacrylate–amine repeating unit and branching agent, every with distinctive component-type embeddings. The repeating unit was handled as a fifth molar part kind alongside lipids, with its molar focus estimated from polymer weight and molecular weight. Complete molar percentages of PBAE and lipids sum to 100%. Inference proceeds as in lipid-only LNPs (‘COMET particulars’ part).

COMET PBAE LNP lead optimization hits

Two top-performing PBAE LNPs per cell kind have been used as beginning factors. Round every, digital candidates have been generated by (1) exploring inside a 20% L1 molar share neighbourhood, (2) substituting lipids (6 ionizable lipids, 2 sterols, 1 helper and 1 PEG) and (3) changing to dual-ionizable compositions. To pick three numerous candidates, we outlined three non-overlapping segments: one inside the 20% L1 distance however should have the identical lipid decisions, one with at the very least one completely different lipid compound and one with a dual-ionizable lipid configuration. The highest predicted LNP from every phase was chosen (Fig. 3d, proper). Closing hits are detailed in Supplementary Tables 22 and 23.

Human IL-15 screening

The IL-15 mRNA is synthesized through in vitro transcription with a HiScribe T7 mRNA package with CleanCap Reagent AG (E2080S) from New England Biolabs, with 5-methoxy-UTP (N-1093) from Trilink. The LNP transfection is completed at an mRNA focus of 0.25 µg ml⁻¹ within the 96-well plate format. The Human IL-15 expression stage is measured with Human IL-15 Uncoated ELISA (88-7620) procured from Invitrogen, after 16 h of incubation of HepG2 cells with LNP. Uncooked efficacy information are normalized, much like bioluminescence information talked about above, earlier than used as dataset for machine studying experiments. This dataset (20%) is randomly cut up into take a look at set, whereas the remainder is used because the prepare and validation units.

Lyophilization of LNPs

The LNPs are synthesized in a tris buffer (5 mM tris buffer, pH 8). After synthesis, the LNP formulations are frozen at −80 °C for two h earlier than present process the next lyophilization course of: equilibrate at −40 °C for two h, in ambiance → −40 °C for 21 h, in vacuum → 25 °C for two h, in vacuum. Labconco FreeZone 6 l with a Stoppering Tray Dryer was used for lyophilization.

Degradation within the efficacy

Publish-lyophilization efficacy values have been computed and normalized equally to the LANCE label values (‘Information processing’ part). The degradation of efficacy owing to lyophilization was calculated by subtracting the post-lyophilization efficacy values rating from the LANCE B16-F10 values.

Animal experiments

Animal experiments for this examine have been authorized by the Massachusetts Institute of Expertise Institutional Animal Care and Use Committee and have been in keeping with native, state and federal laws as relevant. Feminine C57BL/6J mice (000664, The Jackson Laboratory) have been used within the experiments. For imagining, d-luciferin (LUCK-1G, Gold Biotechnology) solubilized in PBS was administered through intraperitoneal injection and the mice have been imaged utilizing an IVIS imaging system (PerkinElmer).

t-SNE visualization

We chosen the COMET mannequin most correlated (Spearman) with ensemble scores throughout a random digital LNP subset. LNP options for t-SNE have been the ultimate [CLS] token representations. To make sure even distribution throughout ionizable sorts, dual-ionizable lipid LNPs have been handled as a definite class, and 1,250 LNPs per class (8 whole) have been randomly sampled (10,000 whole).

Built-in gradients implementation

To execute built-in gradients (IG) with COMET’s multimodal inputs, we tailored the Captum library. IG computes attribution by integrating gradients alongside a path from reference to enter. Characteristic attributions have been computed per LNP, baseline-subtracted and averaged throughout every group. Non-PBAE LANCE LNPs have been used because the baseline. Attribution scores have been normalized (max = 1) and averaged throughout ensemble fashions.

Reporting abstract

Additional info on analysis design is obtainable within the Nature Portfolio Reporting Abstract linked to this text.