Improving Big Data Box-Cox Transformation on Spark
Page: 1-61
2017
- 1,127Usage
Metric Options: CountsSelecting the 1-year or 3-year option will change the metrics count to percentiles, illustrating how an article or review compares to other articles or reviews within the selected time period in the same journal. Selecting the 1-year option compares the metrics against other articles/reviews that were also published in the same calendar year. Selecting the 3-year option compares the metrics against other articles/reviews that were also published in the same calendar year plus the two years prior.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Metrics Details
- Usage1,127
- Abstract Views1,127
- 1,127
Thesis / Dissertation Description
This study investigates improving Spark computation with Box-Cox Information Array when it is used to implement the linear regression models. In order to find the best linear regression model that fit the data, traditional methods have to read whole data many times, which is really time-consuming. Apache Spark can train linear regression model efficiently with distributed clusters because it processes all the data in memory. However, if the data size is huge or there are a lot of temporary data during the computation, it has to spill the data to disk and read it back later. These frequent I/O operations will affect the Spark computation. With the method proposed by Zhang and Yang (2017), information needed for linear regression can be stored in memory with small matrix called Box-Cox Information Array. This information array requires raw data to be scanned one time only. With this information array, the best linear regression model could be obtained at once. This study applies the Box-Cox Information Array method in Spark to understand how it affects the Spark computation performance. The experiment proves that when training forty-one models, the Box-Cox Information Array method is about 8 times faster than the existing API provided in Apache Spark when training 41 models, and it has better performance of prediction.
Bibliographic Details
Provide Feedback
Have ideas for a new metric? Would you like to see something else here?Let us know