PlumX Metrics
Embed PlumX Metrics

RowCore: A Processing-Near-Memory Architecture for Big Data Machine Learning

2016
  • 0
    Citations
  • 1,230
    Usage
  • 0
    Captures
  • 0
    Mentions
  • 0
    Social Media
Metric Options:   Counts1 Year3 Year

Metrics Details

Article Description

The technology-push of die stacking and application-pull ofBig Data machine learning (BDML) have created a uniqueopportunity for processing-near-memory (PNM). This papermakes four contributions: (1) While previous PNM workexplores general MapReduce workloads, we identify threeworkload characteristics: (a) irregular-and-compute-light (i.e.,perform only a few operations per input word which includedata-dependent branches and indirect memory accesses); (b)compact (i.e., the computation has a small intermediate livedata and uses only a small amount of contiguous input data);and (c) memory-row-dense (i.e., process the input data withoutskipping over many bytes). We show that BDMLs haveor can be transformed to have these characteristics which,except for irregularity, are necessary for bandwidth- and energyefficientPNM, irrespective of the architecture. (2) Based onthese characteristics, we propose RowCore, a row-orientedPNM architecture, which (pre)fetches and operates on entirememory rows to exploit BDMLs’ row-density. Insteadof this row-centric access and compute-schedule, traditionalarchitectures opportunistically improve row locality whilefetching and operating on cache blocks. (3) RowCore employswell-known MIMD execution to handle BDMLs’ irregularity,and sequential prefetch of input data to hide memorylatency. In RowCore, however, one corelet prefetchesa row for all the corelets which may stray far from eachother due to their MIMD execution. Consequently, a leadingcorelet may prematurely evict the prefetched data beforea lagging corelet has consumed the data. RowCore employsnovel cross-corelet flow-control to prevent such eviction. (4)RowCore further exploits its flow-controlled prefetch for frequencyscaling based on novel coarse-grain compute-memoryrate-matching which decreases (increases) the processor clockspeed when the prefetch buffers are empty (full). Using simulations,we show that RowCore improves performance andenergy, by 135% and 20% over a GPGPU with prefetch,and by 35% and 34% over a multicore with prefetch, whenall three architectures use the same resources (i.e., numberof cores, and on-processor-die memory) and identical diestacking(i.e., GPGPUs/multicores/RowCore and DRAM).

Bibliographic Details

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know