Protein sequence space is immense, and the protein/function landscape is rugged and barren: very similar sequences often have greatly different function with the majority of sequences lacking any utility. Protein complexity and our naivety of sequence/structure/function interplay hinder robust de novo design, although several designs have been successfully realized [3–5]. Thus, naïve identification of protein sequences with novel functions, or even mutants with improved function, benefits from combinatorial analysis of many proteins. The efficacy of this approach is directly dependent on combinatorial library quality and the phenotype selection process. The essence of discovery and evolutionary efficiency is to intelligently search sequence space by identifying the effective extent and amino acid distribution of diversity (if any) at each site. Protein discovery and evolution must balance variance sufficient for generation of novel function (dominated by intermolecular interactions) versus conservation sufficient to maintain a high probability of foldable stability (intramolecular interactions). This challenge is heightened in small domains that have limited area for a binder interface and require mutation of a larger fraction of the molecule.
Antibody repertoires have evolved sitewise amino acid distributions across a range of diversities (Fig 1A), which are used in natural and synthetic antibody libraries [10–14]. Yet most synthetic scaffold libraries–including affibodies, affitins, knottins, anticalins, Fynomers, Sso7d, and OBodies–are binary with a fully conserved framework and uniformly diversified paratope (Fig 1B). Note that different scaffolds use different uniform distributions including NNK codons or complementarity biases but generally lack sitewise variation. DARPin domain libraries have six sites with a uniform broad distribution and one site with N/H/Y diversity. A hydrophilic DARPin library includes two additional sitewise diversities. The most sitewise design in non-antibody scaffolds has been introduced in the type III fibronectin domain. Diversification of one, two, or three loops, or the sheet surface, of this 10 kDa beta sandwich has enabled evolution of binding to a host of molecular targets. Antibody-inspired amino acid bias in putative hot spots has proven effective within fibronectin libraries [31,33–35]. Diversification of two loops is evolutionarily superior to one-loop mutation, and although diversification of the third loop (DE loop: G52-T56) is not requisite for high-affinity binding, [30,36–39] it can aid stability. Current library designs randomize G52 with G/S/Y, S53-S55 with Y/S, and 12–22 sites in two other loops with a consistent distribution (30% Y, 15% S, 10% G, 5% each F and W, and 2.5% others except C). Sitewise design was extended beyond the DE loop using accessibility, stability, and homology data yielding nine different diversifications at 11 sites in addition to 12 sites with consistent complementarity-biased diversity. Also, in an alternative paratope approach to engineering fibronectin domains, five sites were identified for three different varieties of constrained diversities (4–8 amino acids) in addition to 12–19 sites with the complementarity biased diversity. While a variety of sitewise diversities have been implemented in the fibronectin domain, the evolved repertoires resulting from these libraries have not been broadly and deeply analyzed.
The current study aims to quantitatively evaluate the broad extents of diversification and sitewise amino acid distributions that evolve in hydrophilic fibronectin domains (Fn3HP) developed as binding ligands. The Fn3HP mutant was previously evolved for hydrophilicity to improve processing and in vivo biodistribution. We posit that a broad repertoire evolved from combinatorial libraries for de novo discovery will exhibit sitewise complementarity-biased amino acid diversity in the binding hot spot, conserved wild-type sequence in the distal framework, and a gradient of diversification at intermediate sites including bias for conservation or interactive neutrality in proximal regions. This gradient is not purely spatial as protein structure and protein-protein interfaces are complex (Fig 1C). Moreover, for novel ligand discovery, the exact paratope is not known ahead of time, which blurs designed localization of a hot spot.
The approach used was high-throughput discovery and directed evolution of thousands of binding ligands to various targets from a diverse combinatorial library followed by thorough sequencing of the library and binder populations to identify diversities and amino acids consistent with functional hydrophilic fibronectin domains (S1 Fig). Deep sequencing of evolved protein populations has proven effective for analysis of functionality landscapes for maturation of single protein clones, protein families, and antibody repertoires. Here we apply deep sequencing to several high-throughput discovery and evolutionary campaigns to identify repertoires of evolved hydrophilic fibronectin domains. The results demonstrate a range of diversities and sitewise amino acid preferences consistent with a benefit of a gradient of sitewise constraint. A constrained library based on the observed evolutionary repertoire provides stable, high affinity binders directly without maturation, and the sequence analysis provides a metric to evaluate the balance of inter- and intra-molecular considerations in library design, which are quantitatively assessed.
Oligonucleotides, including amino acid diversity and loop length variation, were synthesized by IDT DNA Technologies (Tables 1 and 2 and S1 and S2 Tables). Full-length Fn3HP amplicons were assembled by overlap extension PCR. The library of pooled diversified DNA was homologously recombined into a pCT yeast surface display vector within yeast strain EBY100 during electroporation transformation. The protocol was similar to that described by Benatuil et al.. Yeast at an optical density at 600 nM of 1.3–1.5 were washed twice with cold water and once with buffer E (1 M sorbitol, 1 mM CaCl2) and resuspended in 0.1 M lithium acetate, 10 mM Tris, 1 mM ethylenediaminetetraacetic acid, pH 7.5. Fresh dithiothreitol was added to 10 mM. Cells were incubated at 30°, 250 rpm for 30 minutes. Cells were washed thrice with cold buffer E and resuspended to 1.4 billion cells per 0.3 mL buffer E. Six μg of linearized pCT vector and 200 pmol of ethanol precipitated gene insert were added and transferred to a 2 mm cuvette. Cells were electroporated at 1.2 kV and 25 μF, diluted in YPD (10 g/L yeast extract, 20 g/L peptone, 20 g/L dextrose), and incubated at 30° for 1 hour. Cells were pelleted and resuspended in 100 mL SD-CAA (16.8 g/L sodium citrate dihydrate, 3.9 g/L citric acid, 20.0 g/L dextrose, 6.7 g/L yeast nitrogen base, 5.0 g/L casamino acids). Plasmid-containing yeast were quantified by dilution plating on SD-CAA agar plates.
Each resulting Fn3HP naïve yeast library was evaluated for proper library construction by Sanger sequencing clonal plasmids harvested from the transformed yeast (57 clones from generation one and 15 from generation two naïve libraries) and later via Illumina sequencing. The yeast libraries were also labeled with biotinylated anti-HA antibody (goat pAb, Genscript) and anti-c-MYC antibody (9E10, Covance Antibody Products;) to detect the presence of N- and C-terminal epitopes present on either side of the Fn3HP clones, respectively, via flow cytometry. The fractional detection of cells displaying both HA and c-MYC, compared to those displaying HA alone, is indicative of full-length, stop codon-free clones.
Binder Maturation and Evolution
The Fn3HP yeast library was grown in SD-CAA selection media for several doublings (about 20 h) in an incubator shaker at 30°C until an optical density value of 6.0 was reached, at which time the yeast were centrifuged and resuspended in SG-CAA induction media (10.2 g/L sodium phosphate dibasic heptahydrate, 8.6 g/L sodium phosphate monobasic monohydrate, 19.0 g/L galactose, 1.0 g/L dextrose, 6.7 g/L yeast nitrogen base, 5.0 g/L casamino acids) and grown overnight. The induced library was sorted twice via multivalent magnetic bead selections via depletion of non-specific binders on avidin-coated beads and control protein-coated beads followed by enrichment of specific binders on target-coated beads. The pair of magnetic sorts was followed by a flow cytometry selection for full-length clones using the 9E10 antibody against the C-terminal c-MYC epitope tag. Genes were mutated via error-prone PCR with loop shuffling, then electroporated into yeast (EBY100) as previously described. Target binding populations were isolated at two levels of stringency, mid- and high-affinity, for each of the four campaigns. A mid-affinity population included all clones that demonstrated either (i) magnetic bead sorting enrichment at least ten-fold greater for target protein than both avidin binding and non-specific control binding or (ii) binding to 50 nM multivalent target (3:1 stoichiometry of target preloaded on streptavidin-fluorophore) assayed via flow cytometry (S2 Fig). A high-affinity population included all clones exhibiting binding to 50 nM monovalent target assayed via flow cytometry. Herein, clones meeting these criteria are referred to as mid- and high- affinity binders, respectively. Flow cytometry was performed as previously described.
Illumina MiSeq Sample Preparation and Sequence Analysis
Plasmid DNA was isolated from yeast using Zymoprep Yeast Plasmid Miniprep II. DNA samples were divided into separate groups based on library generation of origin and binding affinity. Three categories were included for each generation: naïve clones from the initial libraries, mid-affinity binders collected via magnetic bead sorting, and high-affinity binders collected using FACS. In total, six pools of DNA were isolated and uniquely analyzed in association with generations one and two. Following plasmid DNA extraction, two rounds of PCR were completed to assemble the Fn3HP gene fragment with Illumina primers, index tags, multiplexing bar codes, and TruSeq universal adapter (S4 Table). For all PCR conducted during amplicon library preparation, KAPA HiFi polymerase was used as it has been shown to reduce clonal amplification bias due to GC content as well as fragment length bias. Compatible multiplexing and adapter primers were designed according to TruSeq sample preparation guidelines. Amplicons were pooled and supplemented with 25% PhiX control library to increase MiSeq read accuracy. Illumina MiSeq paired-end sequencing with 2 x 250 read length was conducted (University of Minnesota Genomics Center) to obtain 7.2x106 pass filter (PF) reads from the populations of interest, of which 90% of all pass filter bases were above Q30 quality metric (99.9% read accuracy).
Raw data generated through MiSeq consisted of forward and reverse read files (FASTQ) for each of the six multiplexed sublibraries. Assembly of paired end reads was done using PANDAseq. Assembled reads were analyzed using in-house Python code. Analysis work flow for each of the six subgroups (e.g. naïve, mid-affinity, high-affinity populations originating from first and second generation libraries) consisted of first identifying full-length fibronectin DNA sequences, isolating each of the three diversified loop regions, and, lastly, calculating the amino acid frequency at each site. Additional calculations were necessary for the mid-affinity and high-affinity populations to both remove statistically rare events and avoid overcounting dominant clones. The removal of background (i.e. the rarest 2% of sequences, as determined by the rarity of nonspecific binders) was a precaution taken when analyzing the mid-affinity populations to account for the rare non-binding clones inherently collected via magnetic bead sorting. To address the potential detriment of overcounting within all binding populations, the sequences for each loop region were clustered based on 80% or greater sequence homology. For each cluster of similar sequences, the summation of the amino acids at each site were weighted by a power of one-half, then aggregated across all clusters. The resulting weighted sitewise amino acid values were used for frequency calculations. Statistical analysis was performed using two sample Student’s t-test. Statistical significance was assessed while adjusting for familywise error rate using Bonferroni method, denoted at level α = 0.005.
High-affinity clones from three separate target binding campaigns of the current study were individually evaluated for stability using thermal denaturation midpoint, Tm, in the context of yeast surface display, as previously described. Wild-type Fn3HP and seven random clones from the second generation initial library were produced with a C-terminal six-histidine tag in BL21(DE3) and purified by immobilized metal affinity chromatography and reverse phase high performance liquid chromatography. Purified proteins (1 mg/mL in 2 mM 4-(2-Hydroxyethyl)piperazine-1-ethanesulfonic acid, 50 mM NaCl, 2 mM ethylenediaminetetraacetic acid, 1 mM dithiothreitol) were analyzed via circular dichroism using a JASCO J815 instrument. Measurements of molar ellipticity were taken at 218 nm while heating from 20–98°C at a rate of 1°C per minute. Stability measurements of 15 engineered fibronectin clones were retrieved from previously published studies wherein library design was implemented through a binary approach: broadly diversifying the anticipated paratope, using NNS and NNB codons, and fully conserving all other positions.
Amino Acid Frequencies in Natural Homologs
The Pfam database offers an extensive collection of protein families compiled through hidden Markov models. The Fn3 protein family homologs (PF00041) were aligned to the 101 amino acids in Fn3HP. To reduce the impact of dominant replicate sequences, while still accounting for their repeated observation, the frequency of replicate sequences were counted as the square root of the total number of occurrences. The amino acid frequency at each site was computed (S3B Table).
FoldX was used to determine the mutability of sites 23–31, 52–56, and 76–86 within the tenth type III domain of human fibronectin in the context of several structures cataloged in the Protein Data Bank (PDBs: 1FNA, 1TTG, 2OBG, 2OCF, 2QBW, 3CSB, 3CSG, 3K2M, 3QHT, 3RZW, 3UYO). After performing FoldX repair, random mutants were generated for each of the eleven structures by randomizing the BC, DE, and FG loop regions in accordance with the second generation diversity design scheme. At this point, baseline stabilities were individually calculated for each mutant. To analyze the stability impact upon residue substitution for each position in the diversified regions, all 19 natural residue substitutions were individually introduced to the random mutants. The change in stability (ΔΔGfolding) upon mutation was then calculated for each PDB structure’s collection of mutants. This process was iterated (n > 50) until the ΔΔGfolding associated with each position and residue converged to within 0.2 kcal/mol for at least five consecutive mutants. At each site, the stability impact upon substitution to each amino acid was calculated, creating stability matrices for each starting PDB. The sequences corresponding to the wild-type structures were aligned to account for loop length diversity. Average ΔΔGfolding values were calculated for all 20 amino acids at each diversified site (S3A Table).
The likelihood of a loop position to be proximal to or directly involved with a target binding interface is influenced both by exterior exposure of the side chain (i.e. solvent accessible surface area) as well as its proximity to a region offering sufficient diversified surface area to enable the required enthalpic interactions. The site-specific exposure score is calculated as the product of the solvent accessible surface area and an estimated likelihood of residing at the target binding interface. The latter was quantified on a sitewise basis, averaged across eleven Fn3 crystal structures, using a geometric algorithm. Using Python, the fibronectin orientation that presents maximal diversified surface is identified as follows. BC, DE, and FG loop residues are mutated to alanine to remove wild-type residue size bias. The area of accessible (i.e., visible in a two-dimensional projection as a planar approximation of the interface) diversified surface is calculated for each rotational orientation of fibronectin. The orientation that maximizes accessible view of the paratope, as well as any orientations within 5% of this projected area, is identified starting with a coarse-grained search and optimizing with a fine-grained search. For each site, the maximum area of accessible side chain surface within this set of optimized orientations is calculated. This calculation is repeated for all sites. Sitewise values are averaged across all Fn3 PDB models.
Library design and construction
As a collection of starting points for the evolution of diverse ligands, a combinatorial library was created with various levels of diversity throughout the potential paratope of the hydrophilic fibronectin loops (Table 1). Each loop varied in length as guided by natural sequence frequency. The core of the BC and FG loops– 2–5 sites and 2–6 sites, respectively, depending on loop length–had full amino acid diversity biased to mimic the third heavy chain complementarity-determining region (CDR) of antibodies. Two sites spatially within the BC and FG loop cores were constrained based on previous experimental results. V29, which benefits as a small, reasonably hydrophobic amino acid, was constrained as A, S, or T. G79, which benefits from glycine bias, was mildly constrained to G, S, Y, D, N, or C to increase glycine frequency while mimicking CDRs. Twelve sites adjacent to the core of the BC and FG loops were afforded five levels of diversity: i) wild-type, ii) wild-type or serine (as a small, mid-hydrophilicity neutral interactor), iii) wild-type, serine, or tyrosine (the most generally effective side chain for complementarity), iv) moderate chemical diversity (A, C, D, G, N, S, T, or Y), or v) full antibody-mimicking amino acid diversity. All framework sites are conserved as the sequence of the tenth type III domain of human fibronectin with the hydrophilic mutations V1S, V4S, V11T, A12N, T16N, L19T, V45S, and V66Q as well as the stabilizing D7N. Five sub-libraries were constructed using separate DNA oligonucleotides with degenerate codons for each level of diversity. The five sub-libraries for each of the three loops were pooled at equimolar levels.
The gene libraries were transformed into a yeast surface display system, which yielded 2.0x108 transformants. DNA sequencing of 57 random naïve clones indicated 61% had full-length sequences, 16% contained stop codons naturally arising from the CDR’ diversity, and 21% contained frameshifts. This finding was supported by flow cytometry analysis that revealed 64% of proteins were full-length as evaluated by the presence of a C-terminal c-myc epitope. Thus, the library contained 1.2x108 unique, full-length Fn3HP clones.
Shannon entropy (H = -∑i = 1–20 pi ln pi, where pi is the fraction of amino acid i at a particular site), which describes relative diversity ranging from 0 (fully conserved) to 4.3 (5% of each amino acid), was calculated at each site for the naïve and evolved populations to measure constraint within functional ligands. Diversity at 21 of 25 sites is more constrained in the evolved repertoire than in the unsorted library as evidenced by reductions in the Shannon entropies (average: -0.2 units; S3 Fig). Twelve sites have reduction of at least 0.15 units. Notably, G52 (2.8 to 2.0) and G31 (2.7 to 2.1) are largely driven by 22% and 17% increases in constraint of wild-type G. D23 (2.6 to 2.2), N54 (2.8 to 2.4), and T76 (2.8 to 2.4) are driven by broader constraint.
Even with wild-type constraint averaging 39% in the edges of the BC loop in the original library, wild-type was further enriched in evolved sequences by an average of 11%: D23 (46% to 55%), A24 (39% to 46%), P25 (29% to 40%), and G31 (40% to 56%). Conversely, at the sites in the middle of the loop, fully diversified with complementarity bias, wild-type enrichment was less frequent: absent at A26 (2% to 2%), V27 (<1% to <1%), and R30 (2% to <1%) but present at T28 (6% to 15%).
The FG loop exhibited increases in wild-type constraint throughout: T76, G77, R78, G79, D80, S81, P82, S84, and S85 all elevated wild-type with an average increase of 6%. A83 also increased but with very few sequences available for analysis because of loop length variability. Also, wild-type was not considered at site 86 to eliminate the large, charged side chain K.
In the peripheral DE loop, wild-type constraint was diminished at three of four sites: S53 (44% to 31%), S55 (44% to 28%), and T56 (29% to 27%). Yet at the other site, G52, wild-type was substantially enriched in evolved binders (35% to 57%).
Enrichments Predicted by Proteome or Fibronectin Homologs
The extent to which sitewise biases were correlative with wild-type or homologous residues was evaluated. Of 39 residues with an enrichment of at least 5% (in magnitude, not relative increase), 13 are wild-type, 13 are wild-type homologs (A26S, R30K, S55T, T76S, S84A, N86D, N86S, V29A, S53G, S53D, S55D, G79S, and S81G), and 13 are non-homologous as defined by proteome-wide amino acid homology (BLOSUM62).
Of the 26 residues with enrichment of non-wild-type amino acids, 14 (54%) are also enriched in fibronectin homologs (S3B Table). While these homolog enrichments support the concept of applying consensus design to combinatorial evolution, the other 12 residues (46%) highlight the limitations of such an approach (i.e., the functional capacity of mutations not observed in natural evolution).
Serine was frequent in the naïve library adjacent to the fully diversified sites because of its purported ability to act as a neutral interactor, providing neither substantial detriment nor benefit. It was mildly reduced (from 29% to 24%) in the evolved repertoire.
Sites with Complementarity Bias
On average across all sites with full, complementarity-biased diversity, the biased distribution–based on the distribution observed in natural antibody repertoires–was generally preserved. Modest exceptions were that A and D were enriched by 3% each whereas N and Y were depleted by 3% each.
Comparison of Evolutionary Probabilities from the Original and Evolved Repertoires
The original and evolved sitewise amino acid frequencies were compared for their likelihood to yield functional ligands. The repertoire evident in the evolved population is a more efficient starting point for ligand discovery. Using leave-one-out cross-validation–with family clusters partitioned to ensure unbiased training sets (see Materials and Methods for details)– 6.6-fold more clones were more likely to be identified from the constrained repertoire versus the original library (Fig 3A). The median increase in likelihood was 795%; i.e., an evolved clone was 8-fold more likely to be identified from a combinatorial library based on the new repertoire than the original library. Notably, this original library already had constraint built into it from previous studies, detailed above (particularly constraint to a small hydrophobic residue at 29 and glycine bias at 79).
The probability of evolution from the constrained population versus the original population was also evaluated on a sitewise basis (Fig 3B). Sites exhibit a range of evolutionary enhancements. For example, sites 52 and 54 have 107% and 213% increased likelihood of ligand discovery from the constrained repertoire whereas sites 78 and 81 are essentially neutral (1% and 2% increase towards constrained repertoire).
Loop Length Variation
The BC and DE loops frequently evolved loops one amino acid shorter than wild-type length but also exhibited frequent wild-type lengths (Fig 4). Extended loops were very rare in both BC (2%) and DE (0.1%). Also, a two amino acid deletion in the BC loop, while present in 33% of the original library, only appeared in 2% of evolved domains. Loop lengths were more broadly distributed in the FG loop but with a notable evolutionary preference towards shorter loops.
Library Design Based on First Generation Evolved Repertoire
The amino acid distributions observed in binding ligands evolved from the first generation library were used to design a second generation library, which enables further study of evolutionary repertoires from a more constrained library as well as evolution of useful ligands. Within the BC loop at site 23, wild-type D was enriched (46% in the naïve library to 55% in binders) whereas Y, introduced at 17% because of complementarity bias, was substantially depleted in binders to 8±1%. Small, mildly hydrophilic residues were also enriched–G (5% to 11%) and S (11% to 14%)–but not to levels comparable with wild-type bias. Thus, D23 was conserved in the second generation library. At site 24, wild-type A was elevated (39% to 46%), Y was maintained (13% to 13%), and S was substantially depleted (16% to 6%). Thus, the options of wild-type conservation and mild diversity were further explored in the second generation. At site 25, wild-type P was enriched (29% to 40%), Y was maintained (17% to 16%), S slightly declined (25% to 17%), and H was enriched (6% to 8%). Wild-type conservation appears beneficial yet Y, S, and H warrant further consideration. At site 29, the more hydrophobic A was enriched (42% to 56%) whereas the more hydrophilic S and T were depleted (22% to 12% and 34% to 26%, respectively) but still frequent. Thus, second generation design maintained a distribution of AST. Y31 was enriched from 12% to 17% in binders. Glycine, which occurs with 31% frequency at this site within natural sequences of homologous proteins, increased in prevalence from 40% to 56%. Substantial decreases in S (18% to 9%), N (8% to 3%), and C (7% to 2%) were observed. The second generation library contained GY diversity. Wild-type G52 was enriched (35% to 57%) whereas S was depleted (23% to 11%). Wild-type conservation appears strongly beneficial at site 52. At sites 53–55, Y and N were enriched or maintained whereas S was depleted, but still present at reasonable levels. Thus, the second generation library implemented YNST diversity at these sites. T56 exhibited similar results without S depletion leading to TYSN design. At FG loop site 76, enrichment was observed for both wild-type T (34% to 41%) and S (26% to 32%) whereas N (10% to 3%) and Y (8% to 3%) were depleted in binders. Thus, the next design included T and S as well as the other small mid-hydrophilic G and A. At site 79, wild-type G was enriched (22% to 31%) as was S (19% to 27%). D (19% to 20%) was maintained but C (13% to 5%), Y (11% to 5%), and N (15% to 9%) decreased. Thus, GSDN diversity was used in the second library. Wild-type S85 maintained (62% to 65%), which prompts future conservation. At site 86, N was mildly decreased (48% to 40%) while S increased from 23% to 28% and Y decreased from 10% to 5%. The second generation design was intended to be NS but was erroneously synthesized as conserved N.
At G77, wild-type was enriched from 5% to 9% in binders. Y remained frequent in binders (20% to 16%), and D (11% to 18%) and T (4% to 8%) were enriched. Thus, G77 was set to GSYADTNC in generation two. Though several enrichments and depletions are evident elsewhere, all other CDR’ sites will be maintained as CDR’.
The second generation library (Table 2) was constructed from degenerate oligonucleotides. 4.2x109 yeast transformants were obtained. 71% were full-length as assessed by cytometry and corroborated by Sanger sequencing (67% full-length).
Selection and Analysis
Mid- and high-affinity binders to MET, lysozyme, and rabbit IgG, as well as mid-affinity binders for tumor necrosis factor receptor superfamily member 10b, were evolved and sequenced using Illumina MiSeq with barcodes to identify mid- and high-affinity binders. Sequences were aligned, clustered, and counted, with accommodations to reduce overcounting of highly similar sequences, using an in-house algorithm. 4.8x105 sequences were collected with 2.3x105 identified as unique (Table 3). The sitewise differences between amino acid frequencies in the naïve library and selected binders were calculated at uniquely constrained sites (Fig 5A) and CDR’ sites (Fig 5B).
Sites 24 and 25 exhibit similar results in which significant wild-type conservation was not maintained in binders (52% to 16% of A24 and 55% to 24% of P25) and the other amino acid options were elevated fairly uniformly. At site 29, alanine was increased (37% to 56%) while threonine was depleted (43% to 28%). At site 31, wild-type tyrosine is substantially enriched (53% to 92%) at the expense of glycine (45% to 7%).
In the middle of the DE loop, sites 53–55, asparagines were depleted from their overly high starting points (36% to 29%, 41% to 23%, and 40% to 33%) while serines, which were more rare than designed in the naïve library, were increased (14 to 23%, 13% to 27%, and 13 to 17%). Y and T were essentially maintained thereby supporting the SYNT diversity when equally implemented. Asparagine was also decreased at site 56 (41% to 24%), but wild-type threonine was preferentially increased (19% to 30%).
At the edge of the FG loop, wild-type T76 was increased from 26% to 52% in binders while serine was decreased from 44% to 20%. At site 77, wild-type G (12% to 15%) and aspartic acid (12% to 16%) were enriched, serine (21% to 22%) and asparagine (22% to 22%) were maintained, and tyrosine (9% to 3%) and cysteine (7% to 2%) were depleted. At site 79, the GDSN diversity was consistently maintained in binders.
Sites with Complementarity Bias
In the fully diversified sites, the antibody-inspired diversity was maintained for many amino acids. Sitewise exceptions include wild-type conservation at D80 (9% to 22%), T28 (4% to 11%), and S81 (14% to 18%) and enrichment of isoleucine and leucine at site 30 (4% to 38%). Slight overall exceptions–decrease in cysteine (7% to 4%) and increases in hydrophobics isoleucine, leucine, and valine (sum 8% to 15%)–all compensate for imperfections in the degenerate codons, yielding frequencies more in line with natural antibody repertoires. The decrease of cysteine residues is driven by a lack of enrichment of single-cysteine clones more than depletion of dual-cysteine clones, which is perhaps suggestive of beneficial disulfide bond formation (Fig 6). Evaluation of cysteine pairs in dual-cysteine clones indicates a strong enrichment of clones with a cysteine in sites 26–28, especially 27, of the BC loop and 80–84 of the FG loop. Simultaneous cysteines at sites 76 and 84 are also frequently selected in binders.
Loop Length Variation
Loop length analysis indicated diverse lengths were used in binders with a preference for wild-type lengths, or one amino acid decreases, in the BC and DE loops and a broader distribution in the FG loop with preference for 1–3 residues less than wild-type (Fig 4). The longest FG loop, which is only observed in the tenth type III domain of human fibronectin but not the other fourteen human type III domains, is rarely observed (2%) in binders. The shortest FG loop was also less frequently observed in binders (13%) than in the unsorted library (31%).
The framework sites that were intended for conservation were also analyzed within the naïve and binder sequences to identify mutations, occurring during oligonucleotide synthesis, gene assembly, or directed evolution, that were preferentially present in binding clones. Four mutations were enriched (Table 4). Notably, the P44S mutation, and to a lesser extent S43F, are likely introduced by polyadenylation of the 3’ tail by Taq polymerase during error-prone PCR and then amplified by evolutionary selection.
The binders generated from the second generation library exhibit a range of sitewise amino acid frequency distributions that are not purely spatial (Fig 7A). Four sites exhibit Shannon entropies in excess of 3.5. Three additional sites are in excess of 3.0. Nine sites have entropies from 2.0–3.0. Six diversified sites exhibit Shannon entropies below 2.0. Four sites were conserved, as designed, based on first generation library analysis.
Binder phenotypic characterization
By constraining diversity at select sites, we aim to improve the balance of inter- and intra-molecular interaction evolution and reduce destabilization upon mutation. Thus, we evaluated the stability of several fibronectin mutant populations: binders from both library generations in this work and binders evolved from binary (fully conserved framework, fully diversified loops) libraries from previous literature as well as the naïve second generation population and the parental fibronectin domains (human and hydrophilic mutant) (Fig 8A). The first, second, and third quartile stabilities are higher for the first and second generation libraries relative to binders from less biased libraries. Further still, the median stability of the less biased library binders is less than even that of the naïve members of the second generation library (p < 0.05). Note that Fn3HP is of essentially equivalent stability as Fn3 (84°C vs. 85°C).
In addition to yielding stable binders, the second generation library yields high-affinity binders with little to no evolution. Three binder campaigns continued with additional sorts to identify the strongest binders in the population. Rabbit IgG and lysozyme binders were characterized following two rounds of magnetic bead selection, one round of cytometry sorting at target concentrations of 50 nM and a final round of cytometry sorting at 1 nM, wherein the top 1% of binding events were isolated. Titrations curves of representative clones from the most stringently sorted, non-evolved rabbit IgG and lysozyme populations (Fig 8B) revealed affinities of 150±60 pM and 4±3 nM, respectively. High-affinity MET binders were isolated following three iterations of evolution, each iteration consisting of two magnetic bead selections, one cytometry sort for full-length clones, and one round of error-prone PCR. A representative clone from the evolved MET binding population yielded an affinity of 11 nM (Fig 8B).
Library Design Principles
The high-throughput binder engineering and analysis described herein provides one means of identifying the extents of diversification, as well as the relevant amino acids, at each site. To further explore the broad sequence data set of highly functional clones resulting from this study, we examined if any computational means could have guided this library refinement; i.e. if any computable parameters correlate with the evolved repertoire. The FoldX algorithm was used to predict each site’s tolerance to mutation. The change in stability (∆∆Gfolding) upon mutation across >500 theoretical library variants (see Materials and Methods) was predicted for each of the twenty amino acids at each site. The mutational tolerance was assessed in terms of the number of minimally destabilizing (∆∆Gfolding < 0.75 kcal/mol) amino acid substitutions allowed at each site. Observed amino acid diversity (Shannon entropy) in evolved binders correlated with computational mutational tolerance at exposed sites (Fig 7B). While no correlation was observed for less exposed sites (Fig 7C), it should be noted that the key outliers are the sites most distant from the paratope center (D23 and S85).
More broadly, four elements were evaluated for their correlation with evolved repertoires: sitewise amino acid frequencies from natural fibronectin homologs, sitewise computational stabilities of each amino acid at each site within the context of diverse fibronectin clones, complementarity-biased amino acid distributions observed in antibody CDRs, and sitewise sidechain exposure. The first two elements–frequency in natural homologs and computational stability–provide sitewise amino acid distributions; complementarity bias provides a single site-independent amino acid distribution; and the fourth element–residue exposure–provides a sitewise weight. We examined the ability of these four elements to combine to generate sitewise amino acid distributions that matched the experimentally observed frequencies. The relative weights of the first three elements were allowed to vary for each site based purely on the fourth element: solvent and target accessibility of that site (Eq 1). Sitewise frequencies in natural homologs were calculated from 58,058 homologs from the Pfam database. Sitewise computational stabilities were calculated using FoldX as described above. Complementarity bias was calculated as the amino acid distribution observed in expressed human and mouse antibody CDR-H3 sequences. Solvent accessibility is the relative solvent accessible surface area of each side chain averaged over 11 fibronectin structures. Target accessibility was scored based on the orientation of the amino acid side chain relative to the rest of the diversified paratope (detailed in Materials and Methods).
The relative weights of homolog frequency, computational stability, and complementarity bias were computed that, when linearly combined, yield a sitewise amino acid distribution that is most consistent with the evolved distribution. Weights were calculated for both an exposure-dependent and exposure-independent term, which were summed. The weights provide a relative comparison of the predictive value of each parameter. To evaluate the predictive value overall, the parameters were compared to an unbiased control input matrix (uniformly 5% of each amino acid). Weighted inclusion of all elements yields a library design that matches the experimentally evolved distribution 22 standard deviations superior to designs based on unbiased input matrices (S3 Fig). For well-exposed sites, amino acid diversity is effectively mimicked by 62% complementarity bias, 30% stability computation, and 8% natural frequency (Fig 9). At sites with less exposure, natural amino acid frequency should be more heavily weighted at the expense of stability and, less so, complementarity. Design based on a single element is inferior to randomness for stability, marginally effective (2 standard deviations above random) for natural frequency, and strongly effective (16 standard deviations above random) for complementarity. In short, the evolved repertoire correlates strongly with complementarity and moderately with computed stability and frequency in natural homologs, varying dependent upon sidechain exposure.
In the pursuit of a broadly functional combinatorial library capable of yielding binders to numerous targets, the benefit of diversification is unclear for sites peripheral to a ‘hot spot’ that enthalpically drives high-affinity binding. Moreover, the location of the hot spot can differ across different epitopes being targeted by a scaffold. These peripheral sites can (a) directly contact target, (b,c) impact neighboring residue orientation to improve interfacial enthalpy or reduce entropic penalty upon binding, and/or (d) stabilize the protein. Yet these potential benefits can be offset by the inverse impacts: make unfavorable interfacial contact, worsen neighboring residue orientation, and/or destabilize the protein. If sufficient ‘hot spot’ interfacial area is not yet present for high-affinity binding, then additional sites must be diversified to enable favorable interaction. At some point, this expanded paratope provides sufficient interface for strong, specific affinity. Similar tradeoffs can be considered for peripheral sites. Given the typical detriment of random mutation, [88–90] the average peripheral mutation will hinder all four elements thereby suggesting against diversity. Though as a corollary, on average, mutations in the ‘hot spot’ will negatively impact the last two elements by worsening the entropic penalty upon binding and destabilizing the protein because of imperfect interactions with the conserved peripherals. Thus, peripherals need to be chosen to make neutral to good contact with: (a) intermolecular target; (b,c) intramolecular neighbors involved in binding; and (d) all intramolecular neighbors. For a-c, since beneficial interactions will be unlikely, amino acids–such as serine–that yield relatively neutral interactions may be wise. For d, beneficial interactions are likely for the wild-type residue and conserved neighbors based on their coevolution so conservation should be the aim. Since the precise locations of the hot spots and these transitions will vary for each new ligand-target interface (Fig 1C), we hypothesized the evolved repertoires will exhibit a gradient of diversity from extensive diversity in the potential paratope hot spot to full conservation in the framework. Importantly, this gradient includes moderate diversity, with structural bias, within the paratope interfacing with target yet peripheral to the hot spot. Moreover, more mild diversity is included adjacent to the interfacial residues to yield optimal intramolecular contacts with the newly identified paratope. The range of Shannon entropies (Fig 7) and amino acid frequency distributions (Figs 2 and 5) across many binding sequences against several targets support the benefit of a gradient of diversities within combinatorial libraries. The particular amino acid distributions support the hypothesized benefits of wild-type conservation, serine bias, and complementarity bias, at appropriate sites.
Sitewise optimization of this gradient between intra- and inter-molecular interaction biases can be achieved with broad, high-throughput binder generation and deep sequencing as demonstrated here. Yet this requires a sufficiently effective library to generate numerous binders, which may be difficult for new scaffolds or paratopes. Analysis conducted in the current study (Fig 9) provides further evidence that initial combinatorial library design can be guided by complementarity-determining residues and, when available, natural homolog frequencies, stability data (theoretical or experimental), and side chain exposure to solvent and target. Ongoing studies can quantify the values of complementarity, stability, natural sequence frequency, exposure–and potentially other metrics–in the context of other protein topologies and other functions, such as catalysis.
Sitewise optimization of amino acid frequency, with a range of diversities, can be implemented in numerous ways. Trimer phosphoramidite codons can be used in oligonucleotide synthesis, which enables precise control over each distribution but elevates synthesis complexity and cost. Independent oligonucleotides can be synthesized for each loop sequence, which further elevates control by enabling pairwise (and higher order) site design albeit at an elevated synthesis scope. Simpler, less expensive single-nucleotide mixed degenerate oligonucleotide synthesis can approximate many amino acid distributions, especially with the inclusion of unbalanced nucleotide frequencies as used in this study. The amino acid distribution within antibody CDR-H3 can be closely approximated by unbalanced single-nucleotide methods, but it must compromise on the genetic code connectivity of glycine, tyrosine, and cysteine. Achieving the desired high frequencies of tyrosine (20%) and glycine (16%) yields much more cysteine than desired (10%). In these libraries, we opted to maintain high tyrosine (17%) while limiting cysteine (5%) at the expense of low glycine (4%). Wild-type glycine bias at sites G52 and G79, as well as G77 in the second generation library, enabled this successful compromise as glycine at fully diversified sites was only marginally enriched in binders relative to the original library (2.7% to 4.0% in the first generation; 3.8% to 4.7% in the second generation). Thus, selective sitewise bias is able to effectively provide the evolutionary benefit of the presence of glycine within the DE and FG loops.
Note that the sublibrary synthesis approach in generation one (Table 1) yields coupling between sites within each loop. For example, wild-type D23 conservation pulls wild-type conservation in other BC sites during generation one analysis. In the absence of this coupling in generation two analysis, wild-type conservation at other BC sites (A24, P25, and Y31) is reduced. In the DE loop, wild-type G52 conservation pulls N54 conservation, which converts to N54 depletion in generation two in the absence of G52 coupling. Thus, when evaluating a new scaffold or paratope design, sublibrary construction enables analysis of numerous diversification strategies, but care must be taken to consider coupled sites.
While cysteines were overall depleted from binding sequences relative to the naïve library, select inter- and intra-loop cysteine pairs were enriched. These occurred at proximal locations that are structurally sensible for disulfide bond formation, but further validation is needed to confirm the existence of disulfide bonding. Enhanced evolutionary efficiency of this class of clones warrants consideration of biased design to drive the conformational restriction beneficial to numerous topologies including stapled helical peptides, shark new antigen receptors, camelid antibody domains, and previous fibronectin clones. Yet, while entropically beneficial, this conformational restriction may limit the diversity of paratopes that a library can present. Moreover, it eliminates the benefits of cysteine-free ligands: intracellular use, efficient cytoplasmic production in bacteria, and genetically introduced cysteines for site-specific thiol chemistry.
In conclusion, the extent of diversity and particular amino acid distributions consistent with a broad capacity for evolution of new binding activity were determined for a combinatorial library of a hydrophilic fibronectin domain. A gradient of diversity including sitewise constraints was revealed in evolved clones, which is consistent with natural antibody repertoires but converse to current synthetic scaffold combinatorial library designs. Importantly, the extensive dataset allowed for initial characterization of a broadly applicable data driven library design model that guides the most beneficial distribution of amino acids at each position.