We present parallel algorithms and implementations of a bzip2-like lossless data compression scheme for GPU architectures. Our approach parallelizes three main stages in the bzip2 compression pipeline: Burrows-Wheeler transform (BWT), move-to-front transform (MTF), and Huffman coding. In particular, we utilize a two-level hierarchical sort for BWT, design a novel scan-based parallel MTF algorithm, and implement a parallel reduction scheme to build the Huffman tree. For each algorithm, we perform detailed performance analysis, discuss its strengths and weaknesses, and suggest future directions for improvements. Overall, our GPU implementation is dominated by BWT performance and is 2.78x slower than bzip2, with BWT and MTF-Huffman respectively 2.89x and 1.34x slower on average. © 2012 IEEE.