Generative model enhances enzyme activity and stability

Enzymes, as highly efficient catalysts evolved from nature, have fast turnover rates and precise specificity. Improving enzyme catalytic activity and stability is crucial in various biological, medical, and industrial applications. Multiple computational methods have been used to study the performance of enzyme variants, but accurately simulating the impact of mutants on enzyme catalytic ability remains a major challenge. Machine learning provides a new strategy for modifying enzymes, and a novel approach is to analyze evolutionary related protein sequences using generative models. Generative models learn the probability distribution of protein sequences during natural evolution, and the probability of specific variants is associated with their readings in deep mutation scanning experiments. This correlation indicates that generative models can capture the adaptability of mutants during protein evolution, thus having great potential in exploring the functional sequence space of enzymes.

On November 20, 2023, Professor Arieh Warshel (2013 Nobel laureate in Chemistry and member of the National Academy of Sciences) from the Department of Chemistry at the University of Southern California, USA, published a research paper titled “Enhancing Luciferase Activity and Stability Through Generative Modeling of Natural Energy Sequences” in the Proceedings of the National Academy of Sciences, PNAS, The team used a generative maximum entropy model (MaxEnt) to analyze the homologs of luciferase RLuc, and combined with biochemical experiments, proved that using natural evolution information can design enzyme activity centers and protein scaffolds, and effectively improve enzyme activity and stability. Researchers used evolutionary catalytic information to provide guidance for the engineering modification of luciferase, with a success rate of approximately 50% in improving its activity or stability by introducing single point mutations, and revealed the evolutionary preference of RLuc for emitting blue light. This discovery highlights the ingenious design of nature in the evolution of efficient enzymes, which achieves overall performance improvement by applying different evolutionary pressures to different regions of the enzyme.

 

Figure 1. Generative model MaxEnt for luciferase RLuc

The generative model provides a probability model to capture the impact of mutations in natural evolution by analyzing protein sequences related to evolution, among which the MaxEnt model based on information theory has strong generative ability. Researchers applied the MaxEnt model to the engineering design of luciferase, using RLuc (UniProt ID: P27652) as the target sequence, searched the UniRef90 database to construct multiple sequence alignment (MSA), and applied length normalization with a threshold of 0.7 bit fraction, resulting in a total of 1775 homologous sequences. Subsequently, the parameters were optimized using statistical data obtained from MSA, including the probability of specific amino acids on a single residue and the probability of specific amino acid pairs between two residues. After parameterization, calculate the statistical energy E (S) for each sequence or variant, and associate E (S) with the existing biochemical data of luciferase. This correlation helps to elucidate the connection between evolution and catalysis, and can reveal the strategies for producing luciferase in nature.

Figure 2. Relationship between statistical energy E (S) and biological activity of luciferase RLuc

The team studied the correlation between fluorescence enzyme activity and statistical energy E (S), and identified a total of 26 variants using consensus design. For each variant, the average distance between substrate and mutant residue was determined, and the variants were classified as active centers or enzyme scaffolds based on 8.5 Å truncation. For variants classified as active centers, there is a significant negative correlation between their activity and E (S), as determined by a Pearson correlation coefficient of -0.69 (P value=0.057) (Figure 2A). However, for variants with mutations on the enzyme scaffold, this correlation significantly decreased, with a Pearson correlation coefficient of -0.25 (P value=0.29). The team further investigated whether natural evolutionary data could also reveal the bioluminescent properties of luciferase. Analysis shows that the correlation between E (S) with local mutations in the active center and the peak emission spectrum is 0.51 (P value=0.24), while the correlation in the enzyme scaffold region is relatively small, at 0.22 (P value=0.38) (Figure 2B). These results indicate a significant correlation between luciferase activity and natural evolutionary information, especially at the active center, and E (S), as a scale of evolution, can effectively distinguish between active and inactive enzymes.

Figure 3. Characteristics of the designed luciferase RLuc mutant

Researchers identified low energy sequences with the potential to improve enzyme performance through the E (S) energy landscape, resulting in a total of 220 variants with mutated residues within 8.5 Å of the substrate, as well as 394 variants with mutated residues more than 15.0 Å of the substrate. Afterwards, a variational autoencoder (VAE) was used for low dimensional embedding to characterize the relationship between the generated variants and natural fluorescent enzyme sequences. The VAE model uses two-dimensional latent space to train natural sequences. Natural sequences are organized into different peaks in latent space (Figure 3A), which may be related to evolving systems. Variations created by redesigning active centers or enzyme scaffolds are limited to specific local regions within the potential space (Figures 3B and C). These variants of active centers and enzyme scaffolds occupy different regions in the potential space, close to different peaks in the potential space. The active centers involved in the experiment and the mutated residues in the enzyme scaffold are highlighted in Figures 3D and E.

Figure 4. Activity and stability experiments of luciferase RLuc mutant

In the subsequent experiment, the researchers selected the eight monomers with the lowest E (S) values in the active centers for experimental characterization. The results showed that the success rate of beneficial mutations was about 50% (Figure 4A), and the activity of the other half was lower than that of the wild-type. Therefore, mutations in the active center had a significant impact on catalysis. In addition, the researchers selected the other eight mutants with the lowest E (S) values on the enzyme scaffold, among which the Y298A and E195T mutants showed significantly lower protein production. Therefore, the remaining six mutants were experimentally characterized. The results indicate that mutations in the enzyme scaffold can significantly affect the stability of RLuc, leading to changes in Tm between -5.0 ℃ and 6.0 ℃ (Figure 4D), indicating that the main evolutionary constraint of the enzyme scaffold is not to improve enzyme efficiency, but to maintain stability.

Table 1. Activity of Generative Artificial Intelligence Design Mutants

At present, generative artificial intelligence models trained on natural homologous sequences have shown potential in generating functional sequences similar to those found in nature. Recent studies have employed various models that can generate functional enzyme sequences (Table 1). Despite these advances, engineering enzymes to surpass the performance of their wild-type enzymes remains a significant challenge.

In summary, this study was designed using a Generative Maximum Entropy Model (MaxEnt), which introduced mutations derived from the sequence diversity observed in nature. These mutants successfully enhanced the enzyme activity of the active center and enhanced the stability of the protein scaffold. The success in engineering highlights the potential of generative artificial intelligence in designing enzymes, strengthens the connection between protein sequence evolution and catalysis, provides navigation for exploring a broader enzyme sequence space, and also provides effective strategies for computer-aided enzyme engineering design.