The high demand for data storage and data communication brings many new challenges and concerns, including but not limited to, how to efficiently store data in devices and how to reliably communicate data between different devices to avoid possible errors. Those two problems correspond to the two most important concepts in information theory: source coding and channel coding, where the former compresses data to reduce its size, and the latter enlarges the data to combat errors.In this thesis, four problems about source coding and channel coding are presented with applications in information storage and networks.
For source coding, SEAL is presented, which is a novel data compression approach for system log data and supports causality analysis for system security. Based on information-theoretic observations on system event data, the approach achieves lossless compression and supports near real-time retrieval of historic events. In the compression step, the causality graph induced by the system logs is investigated, and abundant edge reduction potentials are explored. In the query step, for maximal speed, decompression is opportunistically executed. SEAL greatly alleviates the exponentially-growing storage demand for causality analysis in industrial computer security.
For channel coding, an error-correcting code based on low-density parity-check code(LDPC) for DNA storage is first introduced. In this work, DNA storage that uses affordable and portable nanopore sequencing is considered as the reading mechanics. Unlike traditional data storage systems, errors occur asymmetrically among the four types of nucleotide bases of DNA. Quaternary codes can be employed for error correction, but suffer from high complexity. For this problem, a turbo-like decoder for the DNA storage channel is designed. Meanwhile, the corresponding density evolution algorithms are designed to prove the optimal bound for the decoders. Simulation results show that the binary LDPC codes have a similar bit error rate but with a speedup by a factor of 4 compared to quaternary codes.
Next, a source of synchronization errors in DNA storage is identified, which could result in missing symbols in the read result from nanopore sequencing. To limit such errors, constrained codes are presented. Moreover, this work shows encoding algorithms and maximum a posteriori decoding algorithms in the presence of additive Gaussian noise and deletions. The simulation shows a trade-off between the coding rate and the missing-symbol rate. The decoding simulation results show that the algorithms can correct missing-symbols error efficiently. Meanwhile,simulated data from DeepSimulator by Y. Li et al. [1] and the real-world data also show that the constrained DNA strings
have fewer missing-symbol errors.
Finally, coding for network synchronization is investigated. This work studies the problem of clock synchronization in an arbitrary network, viewed as a graph. Each server is a node, and two nodes are connected by an edge if their time discrepancy (edge weight) is measured. The time discrepancy is known by one or both of these two servers. The goal is to reduce the communication cost between server nodes and the master node, which collects the discrepancy information to eliminate the loop-wise offset surplus in the network. For the two important cases where each time discrepancy is known by both adjacent servers and by only one of the adjacent servers, the optimal schemes are found.
1. Li, Yu, et al. "DeepSimulator1. 5: a more powerful, quicker, and lighter simulator for Nanopore sequencing." Bioinformatics 36.8 (2020): 2578-2580.