Improving Big Data Box-Cox Transformation on Spark

Citation DataPage: 1-61

Publication Year2017

0
Citations
1,127
Usage
0
Captures
0
Mentions
0
Social Media

Metric Options: Counts1 Year3 Year

Metrics Details

Usage
1,127
- Abstract Views
  1,127

Thesis / Dissertation Description

This study investigates improving Spark computation with Box-Cox Information Array when it is used to implement the linear regression models. In order to find the best linear regression model that fit the data, traditional methods have to read whole data many times, which is really time-consuming. Apache Spark can train linear regression model efficiently with distributed clusters because it processes all the data in memory. However, if the data size is huge or there are a lot of temporary data during the computation, it has to spill the data to disk and read it back later. These frequent I/O operations will affect the Spark computation. With the method proposed by Zhang and Yang (2017), information needed for linear regression can be stored in memory with small matrix called Box-Cox Information Array. This information array requires raw data to be scanned one time only. With this information array, the best linear regression model could be obtained at once. This study applies the Box-Cox Information Array method in Spark to understand how it affects the Spark computation performance. The experiment proves that when training forty-one models, the Box-Cox Information Array method is about 8 times faster than the existing API provided in Apache Spark when training 41 models, and it has better performance of prediction.

Bibliographic Details

REPOSITORY URLhttps://docs.lib.purdue.edu/dissertations/AAI10271288

URL IDhttps://docs.lib.purdue.edu/dissertations/AAI10271288; https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=19675&context=dissertations

AUTHOR(S)

Huayi Fang

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know