Lawrence Berkeley National Laboratory
VecFinder: Automated de novo identification and removal of vector and adapter sequence from genomic datasets
- Author(s): Zhang, Michael Y.
- Tu, Hank
- Shapiro, Harris
- Platt, Darren
- et al.
High-throughput Sanger sequencing requires DNA to be inserted into bacterial vectors for biological amplification. Adapter or linker oligonucleotides may also be attached to target DNA fragments to facilitate insertion into the vector. These vector and adapter sequences are sequenced concomitantly with the target, or insert, sequence and represent contamination which must be removed from the dataset prior to analysis. Removal of such contamination can be accomplished by screening the dataset against the known sequence of the vector and adapter used to generate the data. However, often in the case of public or collaborator datasets, information regarding these contaminant sequences may be incorrect or absent, resulting in an incomplete screening.We've created a piece of software, VecFinder, which is able to identify the sequence of the vector and adapter from the read sequences alone and subsequently remove it. This alleviates the dependence on the library creators to provide the vector and adapter sequences used for the library. It also automates the previously manual task of identifying and screening the adapter or linker sequence, which can be tedious and time-consuming