The complex sequence landscape of maize revealed by single molecule technologies

Complete and accurate reference genomes and annotations provide fundamental tools for characterization of genetic and functional variation. These resources facilitate elucidation of biological processes and support translation of research findings into improved and sustainable agricultural technologies. Many reference genomes for crop plants have been generated over the past decade, but these genomes are often fragmented and missing complex repeat regions. Here, we report the assembly and annotation of maize, a genetic and agricultural model crop, using Single Molecule Real-Time (SMRT) sequencing and high-resolution genome map. Relative to the previous reference genome, our assembly features a 52-fold increase in contig length and significant improvements in the assembly of intergenic spaces and centromeres. Characterization of the repetitive portion of the genome revealed over 130,000 intact transposable elements (TEs), allowing us to identify TE lineage expansions unique to maize. Gene annotations were updated using 111,000 full-length transcripts obtained by SMRT sequencing. In addition, comparative optical mapping of two other inbreds revealed a prevalence of deletions in the region of low gene density region and maize lineage-specific genes.

Maize is the most productive and widely grown crop in the world, as well as a foundational model for genetics and genomics 1 . An accurate genome assembly for maize is critical for all forms of basic and applied research, which will enable increases in yield to feed the growing world population. The current assembly of the maize genome, based on Sanger sequencing, was first published in 2009 2 . Although this initial reference enabled rapid progress in maize genomics 3 , the original assembly is composed of more than a hundred thousand small contigs, many of which are arbitrarily ordered and oriented, significantly complicating detailed analysis of individual loci 4 and impeding investigation of intergenic regions crucial to our understanding of phenotypic variation 5,6 and genome evolution 7,8 .
Here we report a vastly improved de novo assembly and annotation of the maize reference genome (Figure 1). Based on 65X Single Molecule Real-Time sequencing we assembled the genome of the maize inbred B73 into 2,958 contigs, where half of the total assembly is made up of contigs larger than 1.2 Mb (Table 1, Extended Data Figure 1, 2a).
The assembly of the long reads was then integrated with a high quality optical genome map (Extended Data Figure 1, Extended Data Table 1) to create a hybrid assembly consisting of 625 scaffolds (Table 1). To build chromosome-level super-scaffolds, we combined the hybrid assembly with a minimum tiling path generated from Sanger sequence of bacterial artificial chromosomes (BACs) 9 and a high-density genetic map 10 (Extended Data Figure 2b). After gap-filling and error correction using short sequence reads, the total size of maize B73 RefGen_v4 pseudomolecules was 2,106 Mb. The new reference assembly has 2,522 gaps, the genome maps covering approximately half (1,115) of these allowed us to estimate their mean length to be (27 kb) (Extended Data Figure 2c).
Comparison of the new assembly to the previous BAC-based Sanger assembly revealed >99.9% sequence identity and a 52-fold increase in the mean contig length, with 84% of the BACs spanned by a single contig from the long reads assembly. Alignment of ChIP-seq data for centromere-specific histone H3 (CENH3) 11 revealed that centromeres are accurately placed and largely intact. A number of previously identified 12 megabasesized mis-oriented pericentromeric regions were also corrected (Extended Data Figure   3a Our assembly made substantial improvements in the gene space including resolution of gaps and misassemblies, correct order and orientation. We also updated the annotation of our new assembly, resulting in consolidation of gene models (Extended Data Figure 4a). Published full-length cDNA data 13 improved the annotation of alternative splicing by more than doubling the number of alternative transcripts from 1.6 to 3.3 per gene (Extended Data Figure 5a) with about 70% of genes with support from the full-length transcripts. Our reference assembly also vastly improved the coverage of regulatory sequences, decreasing the number of genes exhibiting gaps in the 3 kb region(s) flanking coding sequence from 20% to <1% (Extended Data Figure 4b, Extended Data 5b). The more complete sequence enabled significant improvements in the annotation of core promoter elements, especially the TATA-box, CCAAT-box, and Y patch motifs (Supplementary information). Quantitative genetic analyses have shown that the genetic variation in regulatory regions explain a large number of phenotypes 5,6 , suggesting that the new reference will dramatically improve our ability to identify and predict functional genetic variation.
After its divergence from Sorghum, the maize lineage underwent genome doubling followed by diploidization and gene loss. Previous work showed that gene loss had bias toward one of the parental genomes 14 , but our new assembly and annotation paint a more dramatic picture, revealing that 56% of syntenic sorghum orthologs map uniquely to the dominant maize subgenome (designated A, total size 1.16 Gb), whereas only 24% map uniquely to subgenome B (total size, 0.63Gb). Gene loss in maize has primarily been considered in the context of polyploidy and functional redundancy 14,15 , but we found that despite its polyploidy, maize has lost a larger proportion (14%) of the 22,048 ancestral gene orthologs than any of the other four grass species evaluated to date (Sorghum, rice, Brachypodium distachyon, and Setaria italica, Extended Data Figure 6).
Nearly one-third of these losses are specific to maize, and analysis of a restricted highconfidence set revealed enrichment for genes involved in biotic and abiotic stresses (Extended Data Table 2), e.g., NB-ARC domain disease resistance genes and the serpin protease inhibitor involved in pathogen defense and programmed cell death 16,17 .
Transposable elements (TEs) were first reported in maize 18 and have since been shown to play important roles in shaping genome evolution and gene regulatory networks of many species 19 . The majority of the maize genome is derived from TEs 2,20 , and careful study of a few regions has revealed a characteristic structure of sequentially nested retrotransposons 20,21 and the effect of deletions and recombination on retrotransposon evolution 22 . In the annotation of the original maize assembly, however, fewer than 1% of LTR retrotransposon copies were intact 23 . By applying a novel homology-independent annotation pipeline to our assembly (Extended Data Table 3 Table 2). The average indel size was ~20kb, with a range from 100 bp (the smallest detectable) to over 1 Mb ( Figure 3B). More than 90% of the indels were unique to one inbred or the other, indicating a high level of structural diversity in maize. As short-read sequence data are available from both Ki11 and W22 8 , we analyzed 1,451 of the largest (<10kb) deletions and found that 1,083 were supported by a clear reduction in read depth ( Figure 3C). The confirmed deletions occurred in regions of low gene density Although maize is often considered to be a large-genome crop, most major food crops have even larger genomes with more complex repeat landscapes 30

Genome assembly
De novo assembly of the genome sequencing data: De novo assembly of the long reads from SMRT Sequencing was performed using two assemblers: the Celera Assembler PBcR -MHAP pipeline 31 and Falcon 32 with different parameter settings. Quiver from SMRT Analysis v2.3.0 was used to polish base calling of contigs. The three independent assemblies were evaluated by aligning with the genome maps.
Contamination of contigs by bacterial and plasmid genomes was eliminated using the NCBI GenBank submission system 33 . Curation of the assembly, including resolution of conflicts between the contigs and the genome map and removal of redundancy at the edges of contigs, is described in the supplemental material.

Hybrid scaffold construction:
To create hybrid scaffolds, conflict-resolved sequence contigs and genome maps were aligned and merged with RefAligner using a threshold P value of 1 × 10 −11 34 . To maximize the sequence content in the hybrid scaffolds, all (scaffolded and unscaffolded) sequence contigs were aligned to the hybrid scaffolds using a less stringent threshold P value (1× 10 −8 ). The unscaffolded sequence contigs that aligned to the hybrid scaffold and maps at least 50% of their length and that did not overlap with already scaffolded sequence contigs were added to the hybrid scaffolds.
Pseudomolecule construction: Sequences from BACs on the physical map that were used to build the maize V3 pseudomolecules were aligned to contigs using MUMMER package 35 with the following parameter settings: "-l(minimum length of a single match) 100 -c(the minimum length of a cluster of matches) 1000". Next, to make sure only the unique hits were used, the alignment hits were filtered with the following parameters: "i(the minimum alignment identity ) 98 -l(the minimum alignment length ) 10000". The scaffolds were then ordered and oriented into pseudochromosomes using the order of BACs as a guide. For quality control, we mapped the SNP markers from a genetic map built from an intermated maize recombinant inbred line population (Mo17 × B73) 10 .
Contigs with markers not located in pseudochromosomes from the physical map were placed into the AGP (A Golden Path) using the genetic map.
Further polishing of pseudomolecules: A round of gap filling was applied to the raw pseudochromosomes using Pbjelly (--maxTrim=0, --minReads=2). The pseudomolecules were then polished again using the Quiver pipeline from SMRT Analysis v2.3.0. To increase the accuracy of the base calls, we performed two lanes of sequencing on each DNA sample (library size = 450bp) using Illumina 2500 Rapid run, which generated about 100-fold 250PE data of Maize B73. Reads were aligned to the assembly using BWA-mem 36 . To correct base calling in the assembly, SAMtools 37 was used to generate the BAM format alignment for the Pilon pipeline 38 , using reads with sequencing and alignment quality above 20.

Annotation
To generate a comprehensive annotation of transposable elements, we generated a The MAKER-P pipeline was used to annotate protein-coding genes 45 . Evidence included publicly available full-length cDNA 46 , de novo assembled transcripts from short read mRNA-seq 47 , Iso-Seq full-length transcripts 13 , and proteins from other species. The gene models were filtered to remove transposons and low-confidence predictions. Additional alternative transcript isoforms were obtained from the Iso-Seq data.

Structural Variation
Leaves were used to prepare high molecular weight DNA and genome maps were constructed as described above for B73. Structural variant calls were generated based on alignment to the reference map B73 V4 chromosomal assembly using the Multiple Local Alignment algorithm (RefSplit) 34 . A structural variant was identified as an alignment outlier 34,48 , defined as two well-aligned regions separated by a poorly aligned region with a large size difference between the reference genome and the map or by one or more unaligned sites, or alternately as a gap between two local alignments. A confidence score was generated by comparing the non-normalized p-values of the two well-aligned regions and the non-normalized log-likelihood ratio 49 of the unaligned or poorly aligned region.
With a confidence score threshold of 3, RefSplit is sensitive to insertions and deletions as small as 700 bp and other changes such as inversions and complex events which could be balanced. Insertion and deletion calls were based on an alignment outlier p-value threshold of 1× 10 −4 . Insertions or deletions that crossed gaps in the B73 pseudomolecules, or that were heterozygous in the genome maps, were excluded.
Considering the resolution of the genome map, only insertion and deletions larger than 100 bp were used for subsequent analyses. To obtain high-confidence deletion sequences, sequencing reads from the maize HapMap2 project 8 for Ki11 and W22 were aligned to our new B73 v4 reference genome using Bowtie2 50 . Read depth was calculated from reads with mapping quality > 20 in 10-kb windows with step size of 1 kb. Windows where read depth dropped below 10 in Ki11 and 20 in W22 (short reads sequencing depths in Ki11 and W22 are respectively 2.32X and 4.04X) in the deleted region were retained for further analysis.   There are totally 2,522b gaps in the pseudomolecules. The top row of orange rectangles are the 1,115 gaps with size estimation from the genome map, the purple rectangles represent gaps without size estimation. The light gray rectangles in second row correspond to the contigs larger than 1Mb and the dark gray rectangles are the contigs smaller than 1Mb. Over half of the genome is represented by the contigs above 1Mb. Figure 2. Phylogeny of maize and sorghum LTR retrotransposon families. Both A) Ty3/Gypsy and B) Ty1/Copia superfamilies are present at higher copy number in maize (red) than in sorghum (blue). Bars (log 10 -scaled) depict family copy numbers.