The Contributions of the Allopolyploid Parents of the Mesopolyploid Brassiceae are Evolutionarily Distinct but Functionally Compatible

The members of the tribe Brassiceae share an ancient whole genome triplication (WGT), and plants in this tribe display extraordinarily high within-species morphological diversity. One proposed model for the formation of these hexaploid Brassiceae is that they result from a “two-step” pair of hybridizations. However, direct evidence supporting this model of formation has been lacking; meanwhile, the evolutionary and functional constraints that drove evolution after the hexaploidy are even less understood. Here we report a new genome sequence of Crambe hispanica, a species sister to most sequenced Brassiceae. After adding this new genome to three others that are also descended from the ancient hexaploidy, we traced the history of gene loss after the WGT using a phylogenomic pipeline called POInT (the Polyploidy Orthology Inference Tool). This approach allowed us to confirm the two-step model of hexaploidy formation and to assign statistical confidence to our parental “subgenome” assignments for >90,000 individual genes. We show that each subgenome has a statistically distinguishable rate of homeolog losses. Moreover, our modeling allowed us to infer that there was a significant temporal gap between the two allopolyploidizations, with about one third of the total shared gene losses between the four analyzed Brassiceae species in the first two subgenomes prior to the arrival of the third subgenome. There is little indication of functional distinction between the three subgenomes: the individual subgenomes show no patterns of functional enrichment, no excess of shared protein-protein or metabolic interactions between their members, and no biases in their likelihood of having experienced a recent selective sweep. We propose a “mix and match” model of allopolyploidy, where subgenome origin drives homeolog loss propensities but where genes from different subgenomes function together without difficulty.


Introduction 1
Fifty years ago, Ohno (Ohno 1970) published his famous, and forceful, opus on the role of gene 2 duplication, and in particular of genome duplication (aka polyploidy), in evolutionary innovation. Since  The tribe Brassiceae experienced hexaploidy between 5 and 9 MYA and after its divergence from 11 Arabidopsis thaliana ). This polyploidy was originally inferred by comparative linkage subgenomes to be more highly expressed: genes from the remaining subgenomes have a correspondingly 26 greater chance of being silenced or even lost completely, suggesting one mechanism that might drive 1 preferential retention. This phenomenon has been called "subgenome dominance", and the resulting 2 pattern of gene retention is known as "biased fractionation" (Thomas et al. 2006; Schnable et al. 2011;3 Yoo et al. 2014). A number of hypothesis have been proposed to explain these differences in homeolog 4 expression, including differences in transposon silencing across subgenomes ;

Results 1
A well-assembled, annotated genome of Crambe hispanica 2 Our assembly of the Crambe hispanica genome contains 1,019 contigs with 11 terminal 3 telomeres and has a total size of 480MB; its contig N50 size is 4.4 Mb. The assembly graph showed low 4 heterozygosity and few assembly artefacts, with the exception of one mega-cluster consisting of a high 5 copy number LTR across 500 contigs and spanning ~30 Mb. The resulting annotated genome is of high 6 quality: we compared its gene set against the Benchmarking Universal Single-Copy Orthologs (BUSCO Using the genome of Arabidopsis thaliana as an unduplicated reference, we inferred blocks of 13 triple conserved synteny (TCS) for each of four genomes sharing the Brassica hexaploidy. We then 14 merged these blocks across all of the species: we refer to each such locus as a "pillar." Each pillar consists 15 of between 1 and 3 surviving genes in each of the four genomes. As described in the Methods, we used 16 both a set of TCS blocks inferred with POInT containing 14,050 pillars (Ppillars) and a separate ancestral 17 genome reconstruction that estimates the gene order that existed just prior to the WGT. The latter contains 18 five reconstructed ancestral chromosomes involving 89 scaffolds with a total of 10,868 ancestral genes. 19 When we match these genes to the TCS blocks computed with POInT, the result is 7,993 ancestrally-20 ordered pillars (Apillars). After the WGT, each ancestral locus could potentially expand to three gene copies, but because of the 2 biased gene loss, the number of genes left in each subgenome is not even. One subgenome is less 3 fractionated (LF), while the other two are more fractionated (MF1 and MF2). Shown here is a window of 4 15 post-WGT loci (out of the total 14,050 loci) in four species that shared the WGT, Brassica rapa, 5 Brassica oleracea, Crambe hispanica and Sinapis alba. Each pillar corresponds to an ancestral locus, and 6 the boxes represent genes connected by synteny (straight line). The numbers on top of each pillar are the 7 posterior probabilities assigned to this combination of orthology relationships relative to the other (3!) 4 -8 1=1295 possible orthology states. The numbers above each branch given the number of genes in each 9 subgenome surviving to that point, with the number of gene losses in parentheses. The numbers below the 10 branches in the first subtree are POInT's branch length estimates (αt). 11 12 13 Inferring the relationship of the four Brassiceae genomes based on gene loss patterns 14 We fit models of WGT evolution (see below) to both several different orderings of the 14,050 15 pillars in the Ppillars set and to the Apillars (Supplementary Table S1). These orderings of the Ppillars differed 16 in their number synteny breaks: we used the ordering with the highest likelihood under the WGT 3rate 17 G1Dom model for our remaining analyses (see below). Similarly, we compared the fit of three possible 18 phylogenetic topologies to the pillars under this model: the remainder of our analyses use the topology 19 shown in Figure 1, which has the highest likelihood. Curiously, one of the other two topologies, while 20 having a lower likelihood under POInT's models (Supplementary Figure S1), is the phylogeny estimated 21 using the plastid genome (Arias and Pires 2012). Because the Apillars give similar parameter estimates but 22 comprise a smaller dataset, we will discuss our results in terms of the Ppillars. 23 24 The three subgenomes differ in their propensity for ohnolog copy loss 1 We used POInT to assign genes from the four triplicated genomes to three subgenomes with high 2 resolution: 75% of the pillars have subgenome assignments with posterior probabilities > 0.84 3 (Supplementary Figure S3). We observe clear signals of biased fractionation: while we estimate that  We assessed the statistical support for these estimated differences in the subgenomes' rates of 8 ohnolog loss using a set of nested models of post-WGT gene loss. We started with a model (WGT Null) 9 that did not differentiate between the subgenomes, meaning that the shared base transition rate from T to 10 D1,2, D1,3 or D2,3 was defined to be α (0 ≤ α < ∞, Figure 2). The transition rate from D1,2, D1,3 or D2,3 to S1, 11 S2 or S3 is scaled by σ: e.g., occurs at rate α•σ. We compared this model to a more complex one that 12 allowed losses of both triplicated and duplicated genes to be less frequent from a posited less-fractionated 13 subgenome LF (WGT 1Dom, Figure 2). This model introduces a fractionation parameter f1 (0 ≤ f1 ≤ 1), 14 which potentially makes the transitions between T and D2,3 rarer than the other T-to-D rates (α•f1; see 15 Figure 2). The WGT 1Dom model fits the pillar data significantly better than does WGT Null ( Figure 2; 16 P<10 -10 , likelihood ratio test with two degrees of freedom). We next compared the WGT 1Dom model to 17 a WGT 1DomG3 model that gives MF1 and MF2 separate loss rates. Again, this model gives a better fit to 18 the pillar data than did WGT 1Dom (P<10 -10 , likelihood ratio test with two degrees of freedom, Figure 2). 19 We hence confirm the presence of three subgenomes, distinguishable by their patterns of homeolog loss. 20 It is important to recall here that our approach does not require the identification of these three 21 subgenomes a priori: the probabilistic assignment of genes to subgenomes is an integral part of the 22 POInT orthology computation: as a result, the inherent uncertainty in these assignments is accounted for 23 in estimating the various biased fractionation parameters. and yellow circles are three single-copy states (S1 for LF subgenome, S2 for MF1 subgenome and S3 for 4 MF2 subgenome). The transition rates between states are shown above the arrows. α: transition rate from 5 triplicated state to duplicated states; ασ: transition rates from duplicated states to single copy states; f: fractionation parameter f1 (0  f1  1), the MF1 and MF2 subgenomes are more fractionated than LF 10 subgenome. WGT 1DomG3 model: two fractionation parameters f1,3 and f2,3 were introduced, 11 distinguishing three subgenomes: MF2 is more fractionated than MF1, and MF1 is more fractionated than (Root-spec. WGT 1DomG3 in Figure 2). This extended model fits the pillar data significantly better than 24 does the original WGT 1DomG3 model (P<10 -10 , likelihood ratio test with five degrees of freedom, Figure  25 2). The biased fractionation parameters for the root branch differ from those of the remaining branches: 26 the value of f1,3 on the root is smaller than on later branches (0.6445 versus 0.7368) while f2,3 is larger 1 (0.6766 versus 0.4078). These values are consistent with a two-step hypothesis: prior to the arrival of LF, 2 there would have been a number of losses from MF1 and MF2, meaning that the relative preference for 3 LF would be higher (smaller f1,3). 4 In our second approach, we developed a specific model of the two-step hexaploidy (WGT 5 1DomG3+RootLF in Figure 2). This model describes the transition from a genome duplication to a 6 triplication: all pillars start in state D2,3: e.g., the MF1 and MF2 genes are present but not the LF ones. We 7 then model the addition of LF as transitions to either the T, D1,2 or the D1,3 states (with rates τ, β1,2 or β1,3, 8 respectively). State T is seen when no losses occurred prior to the arrival of LF, the other states occur 9 when either MF1 or MF2 experienced a loss prior to the arrival of LF. Any pillars that remain in D2,3 had 10 no corresponding gene arrive from LF. Of course, at the level of the individual pillar, we have insufficient 11 data to make such inferences: the utility of this model is to give global estimates of the degree of 12 fractionation seen in MF1 and MF2 prior to the arrival of LF. This model offers a significantly improved 13 fit over WGT 1DomG3 (P<10 -10 , likelihood ratio test with three degrees of freedom, Figure 2). More 14 importantly, we can propose other versions of this model where either MF1 or MF2 is the last arriving 15 subgenome: when we do so, the model fit is much worse than seen with WGT 1DomG3+RootLF model 16 (Supplementary Table S1). Hence, we can conclude that subgenomes MF1 and MF2 had already begun a 17 process of (biased) fractionation prior to the addition of the LF subgenome. 18

A gap between the two allopolyploidies 20
This root-specific model also allows us to estimate the state of MF1 and MF2 immediately before 21 the arrival of LF. In particular, we can estimate the percentage of pillars that had already experienced 22 losses prior to LF's arrival. About 28% of all of the MF1 homeologs inferred to have been lost on the root 23 branch were lost prior to the arrival of LF, with the equivalent number of MF2 losses being 38%. A 24 negligible 0.3% of pillars do not appear to have received a copy of the LF homeolog. in RNA interference processes, suggesting that such interference, targeted to the MF1 and MF2 17 subgenomes, could be one mechanism by which biased fractionation was driven. 18

19
Genes from the same subgenome are not overly likely to physically or metabolically interact 20 For genes with high subgenome assignment confidence (  95%), we mapped those assignments 21 (LF, MF1 or MF2) and the duplication status at the end of the root branch onto the nodes (gene products) 22 of the A. thaliana protein-protein interaction (PPI) network (Methods). For comparative purposes, we also 23 produced a mapping of an extant network, based on the gene presence/absence data and subgenome 24 assignments in B. rapa. Not surprisingly, in the "ancient" network inferred at the end of the common root 25 branch, there are a relatively large number of nodes (1,952) associated with surviving triplicated loci: 1 these nodes were connected by a total of 2,384 triplet-to-triplet edges. The B. rapa-specific network 2 contains fewer nodes with retained triplets (662): there were 263 edges connecting these nodes ( Figure  3 3a).   Conant et al. 2014). Thus, we expected to see that the retained triplets showed higher 3 network connectivity. And indeed, our permutation tests reveal that the retained triplets on the root branch 4 are significantly over-connected to each other in the PPI network (P = 0.018, Supplementary Figure S6). 5 We also hypothesized that proteins coded for from the same subgenome would be more likely to be 6 connected due to preferential retention of genes from a single complex from the same subgenome. To test 7 this idea, we partitioned the gene products based on their subgenome of origin. The LF subgenome 8 contains more genes and thus more exclusive connections (Figure 3b). When considering only genes that 9 had returned to single-copy by the end of the root, we identified 188 LF-LF edges among 886 single copy 10 LF genes, with fewer edges exclusive to MF1 and MF2 genes (30 and 3, respectively). We used 11 randomization methods to test whether the numbers of such subgenome-specific edges differed from what 12 would be expected by chance. When considering the network as a whole, we found that there were 13 significantly fewer LF-LF edges than expected (P = 0.022; Supplementary Figure S6). However, when we 14 considered only the single-copy genes in the network, the number of subgenome-specific edges did not 15 differ from that seen in random networks for any of the three subgenomes (P = 0.286 for LF-LF edges, 16 see Supplementary Figure S6), suggesting that the original dearth of such edges was a statistical artifact 17 resulting from the excess of triplet-to-triplet edges. 18 We also explored the association of between genes' role in metabolism and their pattern of post-19 hexaploidy evolution using the A. thaliana metabolic network (Methods). However, again considering the 20 state of each pillar at the end of the root branch, we did not find an excess of shared metabolic 21 interactions between triplicated or single-copy genes in this network (Supplementary Figure S6). Finally, we asked whether genes from the same subgenome are more likely to be co-expressed. 12 We constructed a B. rapa co-expression network from the RNASeq data described in the Methods 13 section. In this network edges connect pairs of genes that are highly correlated in their expression 14 (Methods). The inferred co-expression network contains 3,933 nodes (e.g., genes) from the LF 15 subgenome, 2,310 nodes from MF1 and 1,982 from MF2. We then counted the number of edges 16 connecting pairs of nodes from the same subgenome. To assess whether there was an excess of such 1 shared subgenome co-expression relationships, we randomly rewired the network 100 times and 2 compared the edge count distributions from these randomized networks to those of the real network. We 3 found that the real network did not show a significant excess of shared edges between genes from the 4 same subgenome when compared to the randomized networks (LF-LF P=0.36, MF1-MF1 P=0.82, MF2-5 MF2 P=0.08, Figure 4). show that, as had also been proposed, the least fractionated subgenome (e.g., the one with the most 20 retained genes) is very likely the genome that was added last. To these proposals, we add the new 21 observation that these hybridization events were likely not particularly closely spaced in time: our model 22 predicts that on the order of 1/3 of the gene losses from subgenomes MF1 and MF2 that occurred on the 23 root branch occurred before the arrival of the LF subgenome. In the future, it will be interesting to further 24 refine the timing of these events: the problem, however, is a challenging one because the allopolyploid 25 nature of the events means that molecular clock approaches will tend to estimate speciation times for the 1 allopolyploid ancestors rather than hybridization times. 2 Many forces shape genome evolution after polyploidy. A tendency for genes that operate in 3 multiunit complexes or involved in signaling cascades to remain duplicated post-polyploidy is best 4 explained by the presence of dosage constraints driven by a need to maintain the stoichiometry and Our results illustrate the importance of these dosage effects, with genes whose products interact with 9 many other gene products in A. thaliana being overly likely to be retained in triplicate in these Brassicae 10 genomes. Notably, this pattern is not observed for metabolic genes, a result we interpret as illustrating 11 metabolism's dynamic robustness to gene dosage changes (Kacser and Burns 1981). 12 We had previously argued that one force driving the biased fractionation that distinguishes the exceptions rather than the rule, meaning that pressure to maintain coadapted complexes is not a 18 significant driver of biases in fractionation. We found that although there was some degree of functional 19 distinction for single copy genes from the LF subgenome (e.g., enrichment in biological processes such as 20 DNA repair and RNA interference), more generally speaking, there was no significant evidence of 21 functionally incompatibility of single-copy genes from different subgenomes. Thus, genes from the same 22 subgenome were not more likely to interact with each other physically, nor were the genes returned to 23 single copy on the common root branch functionally subdivided among the subgenomes. And even the 24 DNA repair enzymes that rapidly returned to single-copy appear to derive from at least two of the three genes may be prone to dominant negative interactions may best explain their preference for a single-copy 1

state. 2
It remains to be seen if the "mix and match" pattern of subgenome retention observed here 3 represents the dominant mode of evolution for most allopolyploidies. Of course, whether or not 4 subgenome conflicts exist may be partly a question of the preexisting differences between the parental 5 species, and a more general survey of allopolyploidies that includes estimates of the subgenome 6 divergence prior to the polyploidy events would be most enlightening. If the pattern holds, however, the 7 implications could be significant: hybridization represents a potentially important means of adaption 8 and an E-value cutoff of 10 -10 (Campbell et al. 2014). If more than 30% of a given gene aligned to 4 transposases after the removal of low complexity regions, that gene was removed from the gene set.  Table S1 shows that POInT's model inferences are 1 consistent across a number of such estimated ancestral orders. Brassica paralogs from each of the three Brassica genomes and a single Arabidopsis gene from each of 13 the two Arabidopsis genomes, representing one "candidate gene" in the reconstructed ancestral genome. 14 Among these, 2,178 homology sets contained the maximum of eight genes (one each from the two 15 Arabidopsis genomes and three each from the two Brassica genomes). 16 The homology sets were used to retrieve the ancestral gene order from adjacency graph using an 17 efficient algorithm called Maximum Weight Matching (MWM) (Zheng et al. 2013). We identified all the 18 gene adjacencies in the four genomes, considering only the genes in the homology sets. Each adjacency 19 was then weighted according to how many of the 8 possible adjacencies were actually observed. The 20 MWM produced an optimal set of 10,944 linear contigs containing all 24,001 putative ancestral genes 21 from the homology sets that included 13,057 of 45,982 total adjacencies in the data set, with the 22 remaining adjacencies being inconsistent with this optimal set. We used the contigs in the output of the 23 MWM to reconstruct each of the 5 ancestral chromosomes. There were 34 contigs containing large 24 proportions of genes originating in two or more of the ancient chromosomes that were discarded, as were 25 any contigs containing four or fewer genes from a Brassica genome. While the 9,712 contigs so omitted 26 represent 89% of all contigs, they represent only 55% of the genes, leaving a small group of large contigs 1 with strong synteny relations in our ancestral reconstruction. We next identified adjacencies among the 2 contigs themselves and applied a second iteration of MWM on them, giving the optimal ordering of those 3 contigs. Combining these orders with the existing gene order information within each contig yields the 4 position of all the genes on each ancestral chromosome. This order was mapped to our set of pillars of 5 TCS, giving a subset of those pillars ordered by this ancestral order estimate. consisting of a "less fractionated" genome (LF), a subgenome with intermediate levels of gene loss (more 14 fractionated 1 or MF1) and an even more fractionated subgenome (MF2). We hence defined state S1 to 15 correspond to LF and S2 and S3 to MF1 and MF2, respectively. 16

17
The phylogenetic relationships of the triplicated members of the Brassicaceae 18 POInT fits these models to the pillar data under an assumed phylogenetic topology using 19 maximum likelihood, allowing us to use that likelihood statistic to compare different phylogenetic 20 relationships among these four hexaploid taxa. POInT's computational demands were too great to allow 21 testing all 15 rooted topologies of 4 species (POInT's models are not time reversible). However, by 22 making the reasonable assumption that B. rapa and B. oleracea are sister to each other, we were able to 23 test the three potential relationships of C. hispanica and S. alba to the two Brassicas. Figure 1 gives the 24 maximum likelihood topology: the two alterative topologies and their likelihoods are given in 25 Supplementary Figure S1. 26

Selective constraints of the retained triplets 1
We identified 218 pillars that retained triplicated genes across all four genomes and where the 2 confidence in their subgenome assignments was  95%. For each such pillar, the 12 sequences were 3 aligned using T-coffee (Notredame et al. 2000). The cladogram for each 12 genes consists of three 4 subtrees grouping four sequences that belong to same subgenome in the same sister group 5 (Supplementary Figure S4). Then, using codeml in PAML (Yang 2007) with CodonFreq set to F3X4, we 6 inferred three distinct dN/dS ratios: one for each of the three subtrees deriving from the three parental 7 subgenomes. 8 9

Functional analysis of single-copy genes from different subgenomes 10
We performed functional analysis for genes where we have high (  95%) confidence that they 11 returned to single copy along the common root branch. Using the corresponding "ancestral" locus from A. subgenome are over-connected in this network, we permutated the subgenome assignments 1,000 times, 1 holding the network topology unchanged. We then compared the actual number of edges connecting 2 single copy genes from the same subgenome with the distribution of this value seen in the randomized 3 networks (Supplementary Figure S6). We also asked whether the ancestral genes corresponding to 4 retained triplets showed an excess of connections amongst themselves. Because the number of edges 5 between retained triplets and between single-copy genes are not independent, we performed an additional 6 set of permutations, in which we held all the triplet rows constant and only shuffled the subgenome 7 assignments of the remaining nodes.  We filtered the dataset to only include genes that were missing a measured expression value for at The assembly of the Crambe hispanica v1.1 genome will be available under NCBI BioProject 5 PRJNA631330. The annotated Crambe hispanica v1.1 genome will also be available to download from 6