Graph-based Geospatial Prediction and Clustering for Situation Recognition
Big data continues to grow and diversify at an increasing pace. To understand constantly evolving situations, data is collected from various location-based sensors as well as people using effective participatory sensing. Static sensors are placed at particular locations, monitoring and measuring important variables from the environment. Additionally, people contribute data in the form of mobile streams through participatory sensing. To process such disparate data for situation recognition, we need to address two major research challenges, namely data quality and data analysis.
This dissertation presents graph-based models as a general framework to combine diverse data sources for situation recognition. To improve spatial data quality, we propose a multimodal geospatial prediction model that integrates heterogeneous data into a unified space-time-value format, extracts multi-resolution spectral features, and predicts values at unobserved locations. We demonstrate the effectiveness of the proposed model using air pollution data. We then combine predicted air pollution data with pollen data to analyze the situation of asthma risk across California. The proposed model based on our predicted values provides greater accuracy in detecting high asthma risk areas compared to previous studies.
Given the wide use of mobile technology, humans also act as sensors and increasingly contribute photos as mobile data streams. We investigate the use of visual concepts for situation recognition and develop new models to tackle the technical challenges of noisy data and processing real-time data, which are problems in existing approaches. First, we propose a graph-regularized linear regression PCA to address photo clustering in a real-time setting. Second, to overcome the noisy data that degrades graph quality, leading to poor recognition results, we incorporate a capped norm into a graph embedding method to remove the adverse effects of outliers. Our models not only outperform existing approaches in overcoming the problems of noisy data and real-time data processing, but also allow us to utilize a new cluster of data through a soft label method. Finally, we apply the proposed approaches to Yahoo Flicker 100 Million photos for detecting evolving situations.