Skip to main content
eScholarship
Open Access Publications from the University of California

UC Irvine

UC Irvine Electronic Theses and Dissertations bannerUC Irvine

Bias Mitigation in Galaxy Zoo Using Machine Learning Techniques

  • Author(s): Silva do Nascimento Neto, Pedro
  • Advisor(s): Hayes, Wayne B
  • et al.
Creative Commons 'BY' version 4.0 license
Abstract

Automated analysis of galaxy structure using machine learning had been attempted several times, but it never performed at a level where it could be reliably used, so in 2007, the Galaxy Zoo initiative, a project where anyone could assist in the morphological classification of galaxies, came to life. The project was a success, and since its inception, it inspired several other citizen projects, not only in Astronomy but also in different fields. It produced one of the most extensive labeled datasets of galaxy morphology, which is a prime in a field that has an abundance of data, but where most of it is unlabeled. This dataset of almost 1 Million Sloan Galaxies is still used today, especially for training machine learning classification models. A major concern is that this dataset contains known and measured human biases, some of which were never corrected.

In this dissertation, we explain how we trained a machine learning model that effectively removes some chirality biases using data from SpArcFiRe (a program designed to isolate and quantify arm structure in spiral galaxies), photometric data provided by Sloan Digital Sky Survey, and labels from Galaxy Zoo. We use it to detect and correct a chirality bias present when selecting a sample of spiral galaxies at any spirality threshold. We also employ a variant of this model to aid in detecting a selection bias that occurs due to reduced visibility of tightly winding arms in distant spiral galaxies.

Finally, we introduce a novel machine learning approach that combines feature vectors to perform classification by solving a linear system in the form of A*x = b. We achieve accuracies over 90% on both the Wine and Iris datasets. Currently, however, the method is too slow when compared with the scalability of other similar methods. We suggest possible directions for increasing its speed without compromising accuracy.

Main Content
Current View