In this thesis we present parallel algorithms and implementations of a bzip2-like lossless data compression scheme for GPU architectures. Our approach parallelizes three main stages in the bzip2 compression/decompression pipeline: Burrows-Wheeler transform (BWT), move-to-front transform (MTF), and Huffman coding. In particular, we utilize a two-level hierarchical sort for the forward-BWT, invent a novel parallel reverse-BWT algorithm, design a novel scan-based parallel MTF and reverse-MTF algorithm, implement parallel Huffman encoding/decoding, and implement a parallel reduction scheme to build the Huffman tree. For each algorithm, we perform detailed performance analysis, discuss its strengths and weaknesses, and suggest future directions for improvements. Overall, our GPU implementation of the compression pipeline is dominated by BWT performance and is 2.78x slower than bzip2, with BWT and MTF+Huffman respectively 2.89x and 1.34x slower on average. Our overall GPU decompression runtime is distributed more evenly and is 1.2x faster, with reverse-BWT 2.45x faster and reverse- MTF+Huffman 1.83x slower than bzip2.