Search

Scholarly Works (108 results)

Sort By:

Show:

Thesis
Peer Reviewed

The Effect of Sampling Methods on Model Performance for Classification of Imbalanced Datasets

Weidner, Jeremy
Advisor(s): Wu, Yingnian

UCLA Electronic Theses and Dissertations (2022)

This paper applies various statistical techniques with the goal of maximizing model performance for the task ofclassification on a dataset with heavily imbalanced classes. A dataset is created by combining several sources into one comprehensive dataset. Exploratory data analysis will be performed to understand the available factors, their corresponding distributions and relationship to the outcome variable. Then steps will be taken to prepare the data for the task of classification. Next, a collection of different training set sampling strategies will be outlined using methods such as Random Over Sampling, Random Under Sampling and Synthetic Minority Oversampling Technique. Machine learning models such as Random Forest Classifiers will be fitted for each of the sets of parameters and the model fit will be evaluated on the test set in order to provide insight into the differences of various sampling techniques in the imbalanced classification task. Metrics used to evaluate model fit will include traditional statistical measures as well as other strategies that more closely align with the specific business problem.

Thesis
Peer Reviewed

Housing Sale Price Prediction Using Machine Learning Algorithms

Zhou, Yichen
Advisor(s): Wu, Yingnian

UCLA Electronic Theses and Dissertations (2020)

In this thesis, I explore how predictive modeling can be applied in housing sale price prediction by analyzing the housing dataset and use machine learning models. Actually, I try four different models, namely, linear regression, lasso regression, randomforest and xgboost. Additionally, as the data have 79 explanatory variables with many missing values, I spend much time dealing with the data. I do explorary data analysis, feature enginnering before model fitting. And then using rmse and R-squared to measure the model performance. After I try four different models, I get some results. As for the first model - linear regression, it doesn’t meet the assumption of equality of the variances. Therefore we can’t use the linear model as the candidate of our final model. Then I try lasso regression, but the RMSE and R-squared looks not so good. Then I try Random forest. The R squared in this model of training set is very good, but in the test set the R squared is relatively low, which may show the RF model is a little bit overfitting. Finally I try the fourth model - xgboost. All of the results of this xgboost model seem very good. Therefore, I will use this xgboost model as my final model to predict the housing price. The xgboost model also shows which variables have important effects on sale price.

1 supplemental file

Cover page: Housing Sale Price Prediction Using Machine Learning Algorithms

Thesis
Peer Reviewed

An Empirical Study of Locally Updated Large-scale Information Network Embedding (LINE)

Xu, Yiwei
Advisor(s): WU, YINGNIAN

UCLA Electronic Theses and Dissertations (2017)

The problem of embedding very large information networks into low-dimensional vector spaces is useful in many tasks such as visualization, node classification, and link prediction. This paper studies the novel network embedding method called the ''LINE'' , which optimizes a carefully designed objective function that preserves both the local and global network structures. In order to rule out the instability on border vertices' embeddings and its influence on core vertices, we only compute the core graph with LINE after peeling out those border vertices. Then we compute the embedding of peeled nodes with locally updated process. This paper also tried to interpret and visualize the local update process with logistic regression, and optimize the local update process by adding prior and intercept to the objective function. Finally, this paper demonstrate the embeddings on several multi-label network classification tasks for social networks such as BlogCatalog and YouTube. Our results show that the optimized LINE outperforms the initial methods 5% in F1-score with YouTube dataset and speed up the convergence time.

Cover page: An Empirical Study of Locally Updated Large-scale Information Network Embedding (LINE)

Thesis
Peer Reviewed

Click Prediction with Machine Learning Tools

Ding, Fan
Advisor(s): Wu, Yingnian

UCLA Electronic Theses and Dissertations (2018)

In this paper, we will explore multiple machine learning tools with their applications in the industry of advertisement technology. Companies like Sabio Mobile Inc. aim to provide platforms for advertisers to get their ads published and target their users with higher accu- racy. With transaction log files offered by Sabio Mobile Inc., we will train several statistical models to predict whether or not a user will click on a certain ad.

Cover page: Click Prediction with Machine Learning Tools

Thesis
Peer Reviewed

Bitcoin Price Forecast Using LSTM and GRU Recurrent networks, and Hidden Markov Model

Xu, Yike
Advisor(s): Wu, Yingnian

UCLA Electronic Theses and Dissertations (2020)

Bitcoin, the first decentralized cryptocurrency, has become popular not only because a growing size of merchants accepts it in transactions, but also because people buy it as an investment. This study focuses on the Bitcoin price forecast using Hidden Markov Model and two machine learning methods, LSTM and GRU Recurrent networks. Evaluated by MAPE and RMSE, the results indicate that the Hidden Markov Model with the Gaussian Mixture Models has the best performance among all methods. The GRU model outperforms the LSTM model, though sometimes it might have a more extreme result. When the price remains constant or changes steadily, the predictions are more precise than the fluctuation period.

Cover page: Bitcoin Price Forecast Using LSTM and GRU Recurrent networks, and Hidden Markov Model

Thesis
Peer Reviewed

Beer Production and Beer Judge in United States

Hu, Yi
Advisor(s): Wu, Yingnian

UCLA Electronic Theses and Dissertations (2019)

The goal of this thesis is to analyze the current beer market in U.S. according to the manufacturing data. It will also give insights into beer judgement based on consumers' reviews. This thesis presents the market trends for producers and summarizes evaluations for beer lovers.

Cover page: Beer Production and Beer Judge in United States

Thesis
Peer Reviewed

Stock Price Prediction using Adaptive Time Series Forecasting and Machine Learning Algorithms

Chen, Lumeng
Advisor(s): Wu, Yingnian

UCLA Electronic Theses and Dissertations (2020)

In this thesis, ARIMA model, Long Short Term Memory (LSTM) model and Extreme Gradient Boosting (XGBoost) models were developed to predict daily adjusted close price of selected stocks from January 3, 2017 to April 24, 2020. Daily stock price data includes columns of open, close, adjusted close, high, low and volume. In ARIMA and LSTM models, the only features we used as model inputs were previous N days’ stock prices. Prediction on day N+1 was calculated based on previous N values. RMSE and MAPE were calculated from this rolling forecast and the actual price in the test dataset. Optimal parameters were selected to be the setting that yielded the lowest RMSE score. Residuals diagnostic was performed to check model assumption for the final ARIMA model. In XGBoost model, feature engineering was used to create two additional features from open, close, high and low price. Same with LSTM model, previous N days features were used as features in day N+1 for prediction. In both LSTM and XGBoost models, training dataset was scaled for model fitting. Features and output from cross-validation and test dataset were scaled too based on previous N days’ values. The prediction results were then reverted back to original scale before calculation of RMSE and MAPE scores.

In conclusion, looking at the prediction versus actual stock price plot for each stock and their RMSE and MAPE scores, all three models produced good forecast of next day’s stock price. However, during the time with great volatility, the lag between forecast value and actual value is more noticeable. In our models, historical N days stock price on its own could provide a relatively accurate prediction on N+1 day’s stock price. In XGBoost model particularly, we found out that N=2 provided better RMSE and MAPE(%) results than other larger values of N (previous N days). As N gets larger, prediction accuracy got lower in XGBoost. In XGBoost feature importance analysis, the most important factor to today’s stock price is its price yesterday. Although the final ARIMA model achieved the lowest RMSE score, grid search for one-step ARIMA forecast model parameters took the longest computing time, while XGBoost model with the second lowest RMSE score required the least time for parameter tuning and forecast calculation.

Cover page: Stock Price Prediction using Adaptive Time Series Forecasting and Machine Learning Algorithms

Thesis
Peer Reviewed

Prediction of Electronic Component Prices: from Classical Statistical and Machine Learning Models to Deep Neural Networks with Feature Embedding

Zhang, Yu
Advisor(s): Wu, Yingnian

UCLA Electronic Theses and Dissertations (2019)

The unit price of an electronic component with certain specifications and purchase details could be crucial for the decision-making of customers. With the massive historical purchasing data at Supplyframe, Inc., classical statistical and Machine Learning (ML) models are used to capture the underlined relationship and predict accurate price. The Naive model using mean estimation is adopted as baseline models, followed by the exploration of a wide range of machine learning models including Ordinary Least Squares, Supporting Vector Machine, $k$-Nearest Neighbors, Random Forests (RF), Extreme Gradient Boost (XGB). To make better use of the unstructured features, Deep Neural Networks (DNN) are built based on Convolutional Neural Networks and feature embedding, which maps unstructured features to higher dimension vectors. We observe that the RF and XGB models outperform other classical statistical and ML models when only the structured features are used while the DNN model is proved to be the most powerful by combining both structured and unstructured features. Consistent superior performances are found for the DNN model in terms of the root mean squared error, the prediction interval of the ratios of observed and predicted values, the prediction coverage rate and the capture of the monotonic decreasing relationship between unit prices and purchase quantities.

Cover page: Prediction of Electronic Component Prices: from Classical Statistical and Machine Learning Models to Deep Neural Networks with Feature Embedding

Thesis
Peer Reviewed

An Application of Customized GPT-2 Text Generator for Modern Content Creators

Fang, Jingwu
Advisor(s): Wu, Yingnian

UCLA Electronic Theses and Dissertations (2021)

The number of content creators in the cyber world is growing faster year by year, and the competition fiercer. Large video platforms such as YouTube offers creators incentives to upload original content more frequently. However, every creator has a different definition of novelty and uniqueness. The biggest challenge a creator has to face every day lies in the generation and practice of ideas.

As a result, a customized and efficient “idea” generator has become necessary in our times, and any content creator, whether video, advertising, or writing, can benefit from making their content unique efficiently without losing their style. The advent of GPT-2/3 makes this possible, and in this thesis, I will explore the types of models, the feasibility of streamlining, and the practical challenges of customizing a text generator for content creators nowadays.

Cover page: An Application of Customized GPT-2 Text Generator for Modern Content Creators

Thesis
Peer Reviewed

Predicting the Returns of Progressive Corporation Stock

Xu, Amanda
Advisor(s): Wu, Yingnian

UCLA Electronic Theses and Dissertations (2023)

In this analysis, the objective is to forecast the stock prices of property and casualty insurance in 2022. This industry is known to be relatively stable and resilient to economic downturns. The data utilizes weekly adjusted closing prices of Progressive Insurance from 2019 to 2021 to form the training set. Three different models were created to predict weekly adjusted closing prices for 2022. The methods used were the LSTM and GRU recurrent neural network models, as well as the ARIMA time series analysis. Based on the results, the GRU method achieved the lowest RMSE due to its ability to avoid overfitting and does not rely on the assumption of stationarity.

Cover page: Predicting the Returns of Progressive Corporation Stock