Skip to main content
eScholarship
Open Access Publications from the University of California

School of Information

Open Access Policy Deposits bannerUC Berkeley

Open Access Policy Deposits

This series is automatically populated with publications deposited by UC Berkeley School of Information researchers in accordance with the University of California’s open access policies. For more information see Open Access Policy Deposits and the UC Publication Management System.

Microestimates of wealth for all low- and middle-income countries.

(2022)

Many critical policy decisions, from strategic investments to the allocation of humanitarian aid, rely on data about the geographic distribution of wealth and poverty. Yet many poverty maps are out of date or exist only at very coarse levels of granularity. Here we develop microestimates of the relative wealth and poverty of the populated surface of all 135 low- and middle-income countries (LMICs) at 2.4 km resolution. The estimates are built by applying machine-learning algorithms to vast and heterogeneous data from satellites, mobile phone networks, and topographic maps, as well as aggregated and deidentified connectivity data from Facebook. We train and calibrate the estimates using nationally representative household survey data from 56 LMICs and then validate their accuracy using four independent sources of household survey data from 18 countries. We also provide confidence intervals for each microestimate to facilitate responsible downstream use. These estimates are provided free for public use in the hope that they enable targeted policy response to the COVID-19 pandemic, provide the foundation for insights into the causes and consequences of economic development and growth, and promote responsible policymaking in support of sustainable development.

Cover page of Public mobility data enables COVID-19 forecasting and management at local and global scales.

Public mobility data enables COVID-19 forecasting and management at local and global scales.

(2021)

Policymakers everywhere are working to determine the set of restrictions that will effectively contain the spread of COVID-19 without excessively stifling economic activity. We show that publicly available data on human mobility-collected by Google, Facebook, and other providers-can be used to evaluate the effectiveness of non-pharmaceutical interventions (NPIs) and forecast the spread of COVID-19. This approach uses simple and transparent statistical models to estimate the effect of NPIs on mobility, and basic machine learning methods to generate 10-day forecasts of COVID-19 cases. An advantage of the approach is that it involves minimal assumptions about disease dynamics, and requires only publicly-available data. We evaluate this approach using local and regional data from China, France, Italy, South Korea, and the United States, as well as national data from 80 countries around the world. We find that NPIs are associated with significant reductions in human mobility, and that changes in mobility can be used to forecast COVID-19 infections.

Cover page of Reconfiguring Diversity and Inclusion for AI Ethics

Reconfiguring Diversity and Inclusion for AI Ethics

(2021)

Activists, journalists, and scholars have long raised critical questions about the relationship between diversity, representation, and structural exclusions in data-intensive tools and services. We build on work mapping the emergent landscape of corporate AI ethics to center one outcome of these conversations: the incorporation of diversity and inclusion in corporate AI ethics activities. Using interpretive document analysis and analytic tools from the values in design field, we examine how diversity and inclusion work is articulated in public-facing AI ethics documentation produced by three companies that create application and services layer AI infrastructure: Google, Microsoft, and Salesforce. We find that as these documents make diversity and inclusion more tractable to engineers and technical clients, they reveal a drift away from civil rights justifications that resonates with the managerialization of diversity by corporations in the mid-1980s. The focus on technical artifacts, such as diverse and inclusive datasets, and the replacement of equity with fairness make ethical work more actionable for everyday practitioners. Yet, they appear divorced from broader DEI initiatives and other subject matter experts that could provide needed context to nuanced decisions around how to operationalize these values. Finally, diversity and inclusion, as configured by engineering logic, positions firms not as ethics owners but as ethics allocators; while these companies claim expertise on AI ethics, the responsibility of defining who diversity and inclusion are meant to protect and where it is relevant is pushed downstream to their customers.

Micro-Estimates of Wealth for all Low- and Middle-Income Countries

(2021)

Many critical policy decisions, from strategic investments to the allocation of humanitarian aid, rely on data about the geographic distribution of wealth and poverty. Yet many poverty maps are out of date or exist only at very coarse levels of granularity. Here we develop the first micro-estimates of wealth and poverty that cover the populated surface of all 135 low and middle-income countries (LMICs) at 2.4km resolution. The estimates are built by applying machine learning algorithms to vast and heterogeneous data from satellites, mobile phone networks, topographic maps, as well as aggregated and de-identified connectivity data from Facebook. We train and calibrate the estimates using nationally-representative household survey data from 56 LMICs, then validate their accuracy using four independent sources of household survey data from 18 countries. We also provide confidence intervals for each micro-estimate to facilitate responsible downstream use. These estimates are provided free for public use in the hope that they enable targeted policy response to the COVID-19 pandemic, provide the foundation for new insights into the causes and consequences of economic development and growth, and promote responsible policymaking in support of the Sustainable Development Goals.

Cover page of Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities

Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities

(2021)

Machine learning (ML) is now commonplace, powering data-driven applications in various organizations. Unlike the traditional perception of ML in research, ML production pipelines are complex, with many interlocking analytical components beyond training, whose sub-parts are often run multiple times on overlapping subsets of data. However, there is a lack of quantitative evidence regarding the lifespan, architecture, frequency, and complexity of these pipelines to understand how data management research can be used to make them more efficient, effective, robust, and reproducible. To that end, we analyze the provenance graphs of 3000 production ML pipelines at Google, comprising over 450,000 models trained, spanning a period of over four months, in an effort to understand the complexity and challenges underlying production ML. Our analysis reveals the characteristics, components, and topologies of typical industry-strength ML pipelines at various granularities. Along the way, we introduce a specialized data model for representing and reasoning about repeatedly run components in these ML pipelines, which we call model graphlets. We identify several rich opportunities for optimization, leveraging traditional data management ideas. We show how targeting even one of these opportunities, i.e., identifying and pruning wasted computation that does not translate to model deployment, can reduce wasted computation cost by 50% without compromising the model deployment cadence.

Cover page of Gamblers Learn from Experience

Gamblers Learn from Experience

(2020)

Mobile phone-based gambling has grown wildly popular in Africa. Com-mentators worry that low ability gamblers will not learn from experience, and may rely on debt to gamble. Using data on financial transactions for over 50 000 Kenyan smartphone users, we find that gamblers do learn from experience. Gamblers are less likely to bet following poor results and more likely to bet fol-lowing good results. The reaction to positive and negative feedback is of equal magnitude, and is consistent with a model of Bayesian updating. Using an instrumental variables strategy, we find no evidence that increased gambling leads to increased debt.

Cover page of Three Lessons from Accelerating Scientific Insight Discovery via Visual Querying.

Three Lessons from Accelerating Scientific Insight Discovery via Visual Querying.

(2020)

Exploratory data analysis is a crucial part of data-driven scientific discovery. Yet, the process of discovering insights from visualization can be a manual and painstaking process. This article discusses some of the lessons we learned from working with scientists in designing visual data exploration system, along with design considerations for future tools.

Cover page of Uncovering Effective Explanations for Interactive Genomic Data Analysis.

Uncovering Effective Explanations for Interactive Genomic Data Analysis.

(2020)

Better tools are needed to enable researchers to quickly identify and explore effective and interpretable feature-based explanations for discriminating multi-class genomic datasets, e.g., healthy versus diseased samples. We develop an interactive exploration tool, GENVISAGE, which rapidly discovers the most discriminative feature pairs that separate two classes of genomic objects and then displays the corresponding visualizations. Since quickly finding top feature pairs is computationally challenging, especially for large numbers of objects and features, we propose a suite of optimizations to make GENVISAGE responsive at scale and demonstrate that our optimizations lead to a 400× speedup over competitive baselines for multiple biological datasets. We apply our rapid and interpretable tool to identify literature-supported pairs of genes whose transcriptomic responses significantly discriminate several chemotherapy drug treatments. With its generalizable optimizations and framework, GENVISAGE opens up real-time feature-based explanation generation to data from massive sequencing efforts, as well as many other scientific domains.

Cover page of Assessing the reliability of a clothing-based forensic identification.

Assessing the reliability of a clothing-based forensic identification.

(2020)

A 2009 report by the National Academy of Sciences was highly critical of many forensic practices. This report concluded that significant changes and advances were required to ensure the reliability across the forensic sciences. We examine the reliability of one such forensic technique used for identification based on purported distinct patterns on the seams of denim pants. Although first proposed more than 20 years ago, no thorough analysis of reliability or reproducibility of this forensic technique has previously been reported. We performed a detailed analysis of this forensic technique to determine its reliability and efficacy.