The rapid generation of biological sequences, such as nucleotide and amino acid sequences, has revolutionized the studies in the field of molecular biology. To name a few applications, DNA sequences generated by the RNA-Sequencing technology facilitate the studies of gene expression analysis; protein sequences represent the primary structure to predict protein-protein interactions. Moreover, the vast amount of sequence data generated from high-throughput technologies gears up the data analysis to the omics level. As a consequence, developing novel computational methods and tailoring existing algorithms are highly imperative to extract relevant and critical knowledge from sequence data.
In this dissertation, we introduce several computational frameworks that leverage the genomic sequences to quantify gene expression and utilize the proteomic sequences to characterize protein-protein interactions. The methodologies presented in these frameworks span different research areas, including feature extraction from string data, string matching for DNA sequence, statistical inference for expression quantification, and sequence-pair modeling through deep learning. As a result, these approaches not only tackle specific challenges in the applications mentioned above but also present the potentials to address issues in other sequencing applications.