Restructuring DNS Log Data for Faster Querying and Optimizing Resource Usage on a Hadoop Cluster
Skip to main content
eScholarship
Open Access Publications from the University of California

Restructuring DNS Log Data for Faster Querying and Optimizing Resource Usage on a Hadoop Cluster

Abstract

Log data from DNS resolvers contain rich information that is quite useful for various research use cases such as estimating the popularity of websites. Log data from approximately 39k resolvers has been collected and stored on HDFS. The data is so huge and not optimally structured that it takes a lot of time and resources to search records of interest from the the log. In this project, we investigate techniques to port the log data to a new format so that it speeds up the query time and takes less resources both to store the data and to query the data. We investigated bzip compression, reformatting/pruning unessential records and partitioning the records into separate buckets and from our experiments, we found that using a combination of reformatting/pruning records with partitioning and efficiently sorting the records based on multiple fields speeds up the domain query by 6 times and takes approximately 8 times less resources to query in comparison with unstructured data. Also, the new data takes about 9 times less space on disk.

Pre-2018 CSE ID: CS2013-0997

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View