Hybrid address spaces: A methodology for implementing scalable high-level programming models on non-coherent many-core architectures


**Published in:**
Journal of Systems and Software

**Document Version:**
Early version, also known as pre-print

**Queen's University Belfast - Research Portal:**
Link to publication record in Queen's University Belfast Research Portal

**General rights**
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

**Take down policy**
The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk.
Hybrid Address Spaces: A Methodology for Implementing Scalable High-Level Programming Models on Non-Coherent Many-core Architectures

Anastasios Papagiannis\textsuperscript{a,}\textsuperscript{*}, Dimitrios S. Nikolopoulos\textsuperscript{b}

\begin{itemize}
  \item \textsuperscript{a}Foundation for Research and Technology – Hellas, Institute of Computer Science (FORTH-ICS), Heraklion, Greece
  \item \textsuperscript{b}School of Electronics, Electrical Engineering and Computer Science, Queens University of Belfast
\end{itemize}

Abstract

This paper introduces hybrid address spaces as a fundamental design methodology for implementing scalable runtime systems on many-core architectures without hardware support for cache coherence. We use hybrid address spaces for an implementation of MapReduce, a programming model for large-scale data processing, and the implementation of a remote memory access (RMA) model. Both implementations are available on the Intel SCC and are portable to similar architectures. We present the design and implementation of HyMR, a MapReduce runtime system whereby different stages and the synchronization operations between them alternate between a distributed memory address space and a shared memory address space, to improve performance and scalability. We compare HyMR to a reference implementation and we find that HyMR improves performance by a factor of 1.71× over a set of representative MapReduce benchmarks. We also compare HyMR with Phoenix++, a state-of-art implementation for systems with hardware-managed cache coherence in terms of scalability and sustained to peak data processing bandwidth, where HyMR demonstrates improvements of a factor of 3.1× and 3.2× respectively. We further evaluate our hybrid remote memory access (HyRMA) programming model and assess its performance to be superior of that of message passing.

Keywords: MapReduce; Single-Chip-Cloud; Resource management; Runtime systems; Parallel Programming Models; Hybrid Address Spaces; Message Passing; Partitioned Global Address Spaces

1. Introduction

Many-core processors use diverging memory architectures. Processors designed for mainstream computing markets tend to use memory hierarchies with private multi-level caches per core and a hardware protocol to keep those caches coherent \cite{1}. This memory architecture resembles earlier shared-memory multi-processors from a programmer’s standpoint. However, processors designed for more specialized markets, such as high performance computing and large-scale data processing, use memory hierarchies without a coherence protocol. Graphics Processing Units (GPUs) \cite{2}, the Intel SCC \cite{3} the Cell processor \cite{4} and the experimental Runnemed prototype \cite{5} are representative examples of non cache-coherent architectures. Programming a non-coherent architecture requires explicit communication between local address spaces, through message passing or Direct Memory Access (DMA). Explicit communication increases the programmer’s burden, as it requires a high level of expertise in parallel programming and deep understanding of the memory hierarchy to master. However, explicit communication may also improve performance, particularly in applications with regular communication patterns. Programmers often opt for a programming...
model based on explicit communication even on cache-coherent many-core processors [6, 7], to exploit the topology of the interconnection network and minimize communication overhead. Runtime systems can ease the burden of programming with explicit communication to a certain extent by implementing high-level communication primitives and packaging them in user-level libraries (e.g., MPI). Alternatively, non-coherent architectures can be programmed with a high-level, shared address space model. In this case, the runtime system implements a virtual shared memory abstraction. Regardless of the choice of programming model, the runtime system is a critical component that largely defines performance, scalability, and programmability.

Runtime systems for non-cache-coherent architectures are currently implemented on top of distributed address spaces, typically using one address space per core. The runtime system itself implements all necessary inter-core communication operations for scheduling and synchronization, as well as all application-level communication through explicit message passing or DMAs. These operations flow either exclusively between local memories or between local memories and DRAM. This implementation paradigm has been used on the Cell processor, for implementing shared-memory programming models such as OpenMP [8], COMIC [9], Sequoia [10], and CellSs [11] and the Intel SCC for the implementation of X10 [12] and Shared Virtual Memory models [13]. Intuitively, explicit communication in the runtime system yields a scalable implementation. In particular, explicit communication leverages on-chip data transfer paths and a scalable NoC interconnect for passing data between cores without paying the cost of off-chip memory accesses. This approach works particularly well for exchanges of messages that fit in on-chip local memories. However, this approach is not necessarily optimal in other cases. Applications often need to transfer large amounts of data between threads in a program with little or no processing on the data itself. If these streaming data transfers flow through the on-chip memory hierarchy, they will incur cache pollution, without offering an opportunity for data reuse. Such operations should be best left uncached to maximize performance. A shared, global address space model suits these operations best.

This paper introduces hybrid address spaces, as a fundamental design and implementation methodology for scalable runtime systems on non-coherent many-core architectures. The intuition behind hybrid address spaces is that a runtime system uses on-chip communication paths between private address spaces for small data transfers, such as those needed to exchange control data for scheduling, and off-chip communication paths through a shared address space, for large, streaming data transfers. To confirm our intuition, we present HyMR, an implementation of the MapReduce programming model [14] on the Intel Single-Chip Cloud Computer [3]. The MapReduce runtime implements a staged execution model. We show that while certain stages are best implemented with message passing over a distributed address space, other stages are best implemented with in-place memory copying in a single, global address space, or with a combination of distributed and shared address spaces. In demonstrating the concept of hybrid address spaces in runtime systems, we make several more contributions towards improving performance and scalability of MapReduce on non-cache-coherent many-core architectures. These contributions include:

- software-controlled staged memory coherence to minimize the overhead of coherence maintenance;
- application-specific, scalable data splitters;
- scalable, interrupt-less work-stealing for non-coherent architectures using exclusively on-chip communication;
- a new implementation of scalable on-chip barrier algorithms for non-coherent many-core processors;
- a new mechanism to enable fast access from a core to the private memory of another core on-chip, which accelerates global exchange operations;
- a parallel sorting algorithm that avoids synchronization between stages and executes critical communication paths using on-chip shared memory.

Our implementation of HyMR provides design guidelines for latency and throughput critical runtime system operations that are common to many, if not all, programming models. These include scheduling and load balancing, data distribution, point-to-point and group communication operations.
We compare HyMR to a reference runtime system implemented using exclusively message passing. HyMR outperforms the baseline in all tests. We also compare HyMR with Phoenix++, a state-of-art MapReduce implementation for hardware-managed cache-coherence systems [15]. HyMR achieves, on average, 3.1× improvement in speedup and 3.2× improvement of bandwidth efficiency compared to Phoenix++, on the same number of cores.

To further demonstrate hybrid address spaces as a viable methodology for implementing parallel programming models, we have also developed a hybrid remote memory access (HyRMA) programming model, which leverages message passing for on-chip, latency-sensitive data one-way transfers and global shared memory for one-way bulk data transfers. We demonstrate HyRMA with a representative stencil code, the Jacobi method. HyRMA improves performance by a factor of 2.41× using 48 cores, compared to a pure message passing approach.

The rest of this paper is organized as follows: Section 2 provides background on MapReduce and the Intel SCC processor. Section 3 presents the design and implementation of DiMR, a reference implementation of the MapReduce runtime for SCC processor, which uses exclusively distributed address spaces. Section 4 presents the design and implementation of HyMR. Section 5 presents our experimental analysis and results. Section 6 presents the implementation and experimental analysis of HyRMA. Section 7 discusses related work and Section 8 concludes the paper.

2. Background

Hardware support for cache coherence on a processor with many cores increases complexity and power [16]. Although many efforts attempt to address the scaling and power limitations of cache coherence on systems with many cores [1], several vendors of many-core processors opt for a non-cache-coherent architecture. On such an architecture, a programmer writes parallel code using either explicit communication mechanisms or a shared virtual memory layer implemented in software. In this section we provide background on non cache-coherent many-core processors and programming models providing a shared memory abstraction on such processors. We discuss in more detail the architecture of the Intel Single Chip Cloud Computer (SCC), a processor prototyped to explore the performance, programmability and power-efficiency of non-coherent architectures. We use the SCC as an implementation vehicle for implementing scalable runtime systems with hybrid address spaces. The use of SCC is by no means limiting our study: our runtime system design and implementation techniques presented later in this paper generalize to any non-coherent many-core processor with programmable memory mapping/translation tables and a mechanism for explicit on-chip communication between cores. We conclude this section by providing background on MapReduce, a parallel programming model for large-scale data processing, inspired by functional languages.

2.1. Intel Single-Chip-Cloud-Computer (SCC)

The Intel SCC\footnote{The SCC is not a stand-alone computer thus to get it running, a management PC (MCPC) needs to be used. The SCC connects to the MCPC through external PCIe.} [17] (Figure 1) is a many-core processor with 24 tiles and two IA cores per tile. The tiles are organized in a 4×6 mesh network with 256 GB/s bisection bandwidth. The processor has four integrated DDR3 memory controllers, one for each group of six tiles. Each core has a private L1 instruction cache of 16 KB, a private L1 data cache of 16 KB and a private unified L2 cache of 256 KB. Each dual-core tile has a 16 KB message passing buffer (MPB). The MPB is the only component of the SCC on-chip memory hierarchy that is shared between cores. The SCC does not implement cache coherence between MPB and caches. The MPB provides space for direct core-to-core communication. Data used in on-chip communication is read from the MPB, bypassing the L2 cache. For writes, a no-allocate policy is used, in conjunction with a write combining buffer in the L1 cache. Software needs to maintain coherence between the MPB and the L1 caches by using an L1 cache invalidation instruction (CL1INVMB), when data is stored in the MPB. According to the processor specifications [18], the latency to read a cache line from MPB buffers and off-chip DRAM are:
Local \( MPB = 45C_c + 8C_m \) \hspace{1cm} (1) 
Remote \( MPB = 45C_c + 4 \cdot n \cdot 2C_m \) \hspace{1cm} (2) 
\( DRAM = 40C_c + 4 \cdot n \cdot 2C_m + 46C_r \) \hspace{1cm} (3)

where \( C_c, C_m \) and \( C_r \) denote the clock cycles of the core, the mesh network and the DRAM respectively and \( n \) denotes the number of mesh network hops required to reach the destination (\( 0 < n \leq 8 \)). Although the difference to access MPB and DRAM is 46 DRAM cycles, accesses to the MPB bypass the L2 cache, which cannot be flushed or invalidated from hardware. The obvious drawback of using the MPB is its small size (8KB per core).

2.1.1. SCC Address Spaces

The SCC uses 32-bit Pentium cores. A programmable, software-managed translation table (called Look-Up Table or LUT) enables the system to extend the width of physical addresses to 34 bits, allowing system configurations with to up to 64 GB of off-chip memory (specifically, up to 16 GB for each of four groups of six tiles). The LUT has 256 entries, each mapping 16MB of DRAM. Software control of LUT mappings provides a means for implementing hybrid private and shared address spaces in the system.

Figure 2 shows the default configuration of LUT entries. The SCC reserves 41 (0–40) entries at the bottom of the LUT to map up to 656 MB of private physical memory for each core. The operating system running on the core uses part of this memory, while the user can use the rest. Intel provides a custom Linux kernel that during the boot process, allocates 5 (34–38) contiguous entries from each core’s private address space, called \( POPS\!H\!M \). Four entries (128–131) in the LUT are shared among all cores. Some parts of this shared memory are used by system services\(^2\). Entries 192–215 in the LUT map MPBs and entries 224–247

\(^2\)For example, MCPC and the on-die network driver that allows TCP traffic from core to core.
map configuration registers of cores. Entry 250 addresses the system interface; access to this memory is confined to the PCIe driver. Entry 251 addresses the voltage regulator control (VRC) registers. There is no restriction in reprogramming LUT entries to translate to a different address space during the execution of a program.

2.1.2. SCC System Software

From the programmer’s point of view, SCC resembles a cluster with portions of memory shared between cores. Each core runs its own image of the Linux kernel. Cores communicate through messages and several libraries that provide message passing primitives are available to programmers, including Intel’s RCCE [3] and RCKMPI [19]. Small messages can be exchanged directly on-chip using the MPBs. Large messages on the other hand can be exchanged via a memory copy in DRAM. Figure 3 shows the flow of messages in both cases, using an example where core 0 sends a message to core 47. When sending a small message of size less than 8KB, the sender writes the message in its local MPB. The L2 cache is bypassed and the L1 cache is configured as write no-allocate. The sender stores flags in the MPB to synchronize this operation with the receiver. When the data is ready, the receiver can read the data to its private memory through its own L1 cache. The MPB provides higher bandwidth and lower latency than the available shared memory. In spite of this advantage, message passing for messages larger than 8KB can be faster through DRAM, due to protocol overheads related to the small size of the MPB and the necessity to split and reassemble large messages into chunks of size up to 8KB. The alternative is to use shared DRAM to exchange messages greater than 8KB. The L2 cache can still be bypassed in this case, to avoid severe cache pollution. When transmitting a large message, the sender writes the whole message in shared memory. The L1 caches need to be flushed to maintain coherence and consistency. The receiver can read the whole message from shared memory through the L1 cache.

Intel’s RCCE library implements message passing using exclusively MPB buffers. On the other hand
RCKMPI uses MPBs for small messages and DRAM for large messages. The SCC provides a facility to invalidate all MPB data with a single instruction (CL1INVMB), flush all L1 cache data with a single instruction (INVFLUSH), or invalidate all L1 cache data with a single instruction (INV). Due to the lack of a hardware flush/invalidate mechanism, the processor can use a software memory driver to flush the L2 cache, if needed. Selective use of the L1 and L2 caches is critical for performance and we revisit this issue while discussing the implementation of HyMR on the SCC.

2.2. The MapReduce Programming Model

MapReduce is a set of language abstractions, inspired by Lisp [14], to express data-parallel computations and aggregations. The MapReduce programming model is widely popular among developers of algorithms for “Big Data” analytics. MapReduce is commonly employed for running crawling and machine learning algorithms on large volumes of text and image data, as well as processing large graphs [14, 20, 21, 22]. Practical implementations provide MapReduce abstractions as a library API or embed MapReduce in a high-level language, such as Java [23, 24, 15, 25, 26, 27, 28, 29, 30, 31].

A MapReduce application applies a parallel operator, the map function, on input data structured as a sequence of <key,value> pairs. The output of the map function is a set of intermediate <key,value> pairs. A user-defined reduction operator, the reduce function, aggregates the intermediate pairs according to their keys. Finally, the aggregated pairs are sorted by key. Aggregation and sorting are optional in MapReduce applications. The language or library may provide standard aggregators and sorting functions for high performance and ease of programming.

Listing 1 shows a textbook MapReduce example that counts the number of occurrences of each word in a collection of documents [14]. The map function emits each word from the documents with a temporary count
of occurrences set to 1. The reduce function measures the total number of occurrences for each unique word. The MapReduce program applies operators on data lying in a single logical address space, albeit the actual implementation may distribute data between physically separate memories and disks. The operators adhere to a share-nothing model, which virtually eliminates races, deadlocks, and most complexities that render correctness checking hard on conventional parallel programming models. On the flip side, the performance of MapReduce programs is heavily dependent on the implementation efficiency and scalability of the runtime system.

To MapReduce runtime system (Figure 4) splits input pairs into work units. Tasks executing the map function (mappers) process work units in parallel across multiple nodes, processors, or cores. The runtime system partitions the intermediate pairs produced from mappers into buckets with each bucket holding pairs with the same key. These buckets, called partitions in MapReduce parlance, are distributed between tasks executing the reduce function (reducers). The runtime system finally merges and sorts the output pairs produced by reducers.

A MapReduce runtime system must optimize execution-time parameters such as the size of work units, the number of mappers and reducers, the assignment of work units to nodes, processors or cores and the allocation and management of buffer space between stages of the computation. The runtime system can
perform several additional optimizations: eliminate global synchronization between stages of MapReduce, using a dataflow execution model [32]; eliminate function call overheads by increasing the granularity of work units [23, 24]; reduce load imbalance also by adjusting the granularity of work units and/or the number of mappers and reducers [33]; optimize locality and overlapping computation with data transfers by prefetching work units [34]; and conserve bandwidth and cache space via hardware compression [35]. The runtime system can also provide scalable, application-specific fault tolerance, which is beyond the scope of this work.

3. DiMR Design and Implementation

To place HyMR in context, we first discuss a reference implementation of the MapReduce runtime system using exclusively message passing over distributed address spaces. This design views the SCC as a cluster of single-core nodes, each with its own Linux image. Cores exchange messages using the RCCE library [3]. RCCE implements all communication between cores through MPBs. We choose RCCE over RCKMPI due to superior performance. Figure 5 shows that native RCCE achieves better throughput than RCKMPI, when communication flows through the SCCMPB channel, which uses exclusively the on-chip message-passing buffers. A detailed description of the reference design is available in [36].

The reference design implements a seven-stage runtime system for MapReduce. We refer to the seven stages as map, combine, partition, group, reduce, sort and merge. The combine and merge stages are optional in typical MapReduce setups, whereas the group stage replaces an intermediate sorting stage of MapReduce to reduce computational complexity [27, 29, 23, 26]. Figure 6 shows the stages and what messages are exchanged between cores in each of them. We use the WordCount benchmark as an example to explain the details of these stages.

In the map stage, the runtime system divides the input evenly to as many partitions as the number of cores. Each core then executes the user-defined map function over the data in its private partition. During this stage the runtime does not exchange any messages between cores. This function takes a <key,value> pair as input and produces one or more intermediate <key,value> pairs. The volume of the intermediate output is unknown until runtime. To reduce memory management overhead, the reference design preallocates a large chunk of memory (64 MB in our implementation) to hold intermediate data and allocates more space on demand, if the intermediate data overflows the preallocated chunk. To split intermediate data between different partitions, the reference implementation provides an option between a user-defined hash function and a generic hash function, the latter implemented in the MapReduce runtime system. The hash function takes a key as an argument and returns the ID of a partition to store the generated intermediate <key,value> pair. Each core emits keys and values in a contiguous buffer.

The combine stage executes if and only if the user provides a combiner function. This stage is executed locally, as does map, and does not exchange messages between cores. The purpose of this stage is to reduce

---

3Not to be confused with the partition stage of the MapReduce runtime system.
locally the size of each partition produced by a given core during **Map**. The combiner function takes a key and a list of partially aggregated intermediate values associated with the same key, as input. It produces a single `<key,value>` pair where the value is an updated partial aggregation of the values associated with the key, as output. Following the **Combine** stage, the runtime system synchronizes the cores using a barrier.

The **partition** stage performs an all-to-all exchange between cores. Data partitions generated during **Map** may differ in size. DiMR uses a custom all-to-all exchange algorithm for the SCC to achieve scalable data partitioning. The algorithm first executes an all-to-all exchange of the intermediate partition’s sizes, followed by an all-to-all exchange of the intermediate data [36]. The algorithm implements the all-to-all exchange using pairwise exchanges. Let $p$ be the number of available cores and $rank$ the core ID. This algorithm uses $p - 1$ steps and in each step $k = 1 \ldots p - 1$ the core ranked $i$ receives data from core $i - k$ and sends data to core $i + k$. We use the $RCCE_{\{send, recv\}}$ functions to implement this all-to-all exchange.

The **group** stage groups together all `<key,value>` pairs with the same key, taken across all intermediate data partitions. All the data needed by each core in the **group** stage lies in the core’s private memory and there is no need to exchange any messages between cores. Prior research [27, 29, 23, 26], uses generic sorting with a user-defined comparator to perform grouping in MapReduce. Our reference implementation uses a variant of radix sort [37] for grouping on the SCC. The quicksort algorithm employed in prior MapReduce implementations on multi-core systems has complexity $O(n\log n)$, whereas radix sort has complexity $O(kn)$ where $k$ is the size of the key in bytes. Figure 7 shows a comparison of the libc quicksort implementation and our radix sort implementation for different input sizes. The measurements are from one core on the SCC. Radix sort outperforms quicksort, with one caveat. Radix sort sorts strings of bytes and can not use a user-defined comparator for sorting. This implies that in applications where the key data type is not a string, radix sort may produce unsorted sequences that need to be processed further in the following stages of
MapReduce. In the common case, the data produced before the reduce stage is more than the data produced after the execution of the reduce stage. This happens because key duplication in the data generated before the reduce stage. Following the reduce, there are only distinct keys and a single value associated with each key. We choose to run the actual sorting algorithm after the reduce stage.

The reduce stage executes a user-defined key aggregation function. The prior group stage exports an array of distinct keys, each containing the number of occurrences of the key and a pointer to an array of its values. The output size of the reduction stage can be statically identified, therefore the implementation preallocates the stage’s output buffers. In the sort stage, the implementation sorts the <key,value> pairs produced following the reduction, using sequential quicksort and a user-specified comparison operator. Both reduce and sort stages execute locally on private memory and do not necessitate the exchange of messages between cores. An optional merge stage merges the output of all cores in one core. The reference implementation uses the binomial merge algorithm for this stage [38], which completes in \( \log n \) steps. In each of these steps cores exchange the previously merged output data.

4. HyMR Design and Implementation

In a hybrid address space design, a runtime system uses on-chip communication paths for small data transfers, such as the data transfers needed to pass pointers for the purposes of scheduling, and off-chip communication paths through shared memory for performing transfers of large messages with application data. HyMR implements a staged execution model. We elaborate why while certain stages are best implemented over a distributed address space, other stages are best implemented over a shared address space.

4.1. HyMR Stages

Figure 8 shows the stages of HyMR. HyMR has four stages, compared to DiMR’s seven. HyMR merges the Map and Combine stages into a single stage and eliminates the Group stage. A new implementation of the Map stage allows the grouping of intermediate data before the Reduce stage. HyMR further merges the Sort and Merge stages into a single Sort stage. As the Sort stage is implemented using a shared memory address space, there is no need to merge the sorted partitions. HyMR uses Partition and Reduce stages which are identical to the respective stages in the reference design.

HyMR implements application-specific memory coherence using the MapReduce execution stages as natural coherence boundaries and MapReduce stage semantics as hooks for coherence actions in the runtime system. The runtime system guarantees coherence at the completion of stages. HyMR flushes the L2 cache following the execution of mappers and combiners, as the privately owned POPSHM address space is cacheable and the SCC has no native hardware support for cache coherence. The flush completes with
a memory barrier. *Partition* and *Reduce* execute no coherence actions, as both these stages execute in
distributed, private address spaces. To guarantee that all cores complete with *Reduce* stage we execute a
barrier before the *Sort* stage. The *Sort* stage uses a parallel sorting algorithm with regular sampling (PSRS).
PSRS executes in four sub-stages, (quicksort, local regular sample sorting, exchange and merge), separated
by barriers. The runtime system flushes the caches of each core after the completion of quicksort and merge
substages. We provide more details in Section 4.1.5.

### 4.1.1. Scalable Application-Specific Data Splitters

HyMR uses scalable input splitters over a shared address space. The input is stored in shared off-chip
memory and is accessible from all cores. The input is read-only so there is no need for synchronization in
accessing the input during splitting. Each core retrieves a private partition of the input without communicat-
ing with other cores, using a local, sequential prefix-scan algorithm. Therefore, splitting can be implemented
entirely in parallel. Following splitting, each core allocates a queue in its private MPB buffer for the input
<key,value> pairs. The runtime executes a user-specified *map* function on each item in the queue. The split
function distributes the input evenly between cores, although application-specific splitters can be used in the
same context for better load balancing. HyMR provides three application-specific *splitters*, a *text splitter*, a
line splitter and a generic splitter. Users may also implement a custom splitter to divide the input size in a different way than the three provided splitters. The generic splitter uses a prefix-scan algorithm running independently on each core, to identify the beginning of each core’s chunk in the input without inter-core communication. The text and line splitters divide characters or text lines by default as evenly as possible between cores.

4.1.2. Map

Map tasks have no side effects and no dependencies between them [14]. Therefore, they are suitable for running in a distributed address space. No coherence actions are needed during the execution of the Map stage. Committing combined intermediate data to shared memory necessitates a flush of the L2 cache at the end of the Map stage, which includes a combiner. The runtime system stores the output of each mapper task running on a core in the core’s POPS HM address space.

Each core executes mappers that process a queue of inputs provided from splitters. Mappers emit intermediate <key, value> pairs, using a user-specified hash function to distribute their intermediate outputs between as many partitions as the number of cores. These partitions are aggregated in following MapReduce stages. Each core uses a private, cacheable POPS HM address space for mapping data, as no coherence actions are necessary during this stage. This space is represented by five LUT entries, or 80MB. The output of mappers is held in containers, implemented as an array of lists of values, with one list per key. HyMR uses a hash table with open addressing, which is faster than separate chaining, Red-Black trees and AVL trees, which we also evaluated on the SCC. The hash table contains 4096 buckets. The runtime system implements dynamic resizing of the hash table if a core exports more than 4096 intermediate <key, value> pairs, by doubling the size of the table when the fraction of used buckets in the table exceeds a predefined threshold (currently set to 0.8). HyMR’s hashing uses quadratic probing to resolve collisions. Cores can not export more than five LUT entries (80MB) of intermediate data. The POPS HM implementation in the Linux kernel sets this as a hard limit. The runtime system uses a custom, fast memory allocator with pointer bumping and performs no deallocation in POPS HM address spaces.

HyMR combines the output of mappers, by reducing the data with a user-defined aggregator. The distributed memory implementation uses an all-to-all exchange at this stage. Implementing a combiner in DiMR would necessitate data marshaling (serialization and deserialization), which would add substantial communication overhead. HyMR on the other hand optimizes the combiner by performing an in-place aggregation of intermediate data in private memory, as the data is produced by mappers. This minimizes space and time overhead by avoiding redundant memory allocation and storing only aggregated data.

4.1.3. Partition

HyMR uses cacheable shared memory to implement an all-to-all exchange of the voluminous, in the common case, data emitted from mappers. The runtime system merges all intermediate containers of each core in a single container stored in private memory. This container contains <key, list-of-values> pairs. HyMR stores distinct keys and for each key assigns a list of all values produced by all cores during Map stage. The runtime system then goes through an iterative process where in each iteration, it modifies the LUTs of a core to map to the POPS HM private address space of another core. The runtime system knows at execution time the starting physical address of each POPS HM segment. We use an Intel driver to map the physical addresses of each POPS HM segment to the virtual address space of user programs. Coherence actions are avoided, by marking the pages in POPS HM address space as non-cacheable in the L2 cache. Given that all POPS HM pages are read-only in this stage and there is no physical data copying involved, there is no need to flush the L1 caches. An invalidation of the caches before each LUT remapping suffices for coherence. The runtime system avoids using the L2 cache in this stage because of the lack of an efficient, hardware supported invalidation mechanism. Therefore, the runtime system only has to invalidate the L1 cache after each remapping. The remapping process requires as many iterations as the number of cores. To avoid contention when two or more cores access DRAM through the same memory controller, each core begins remapping from its local core’s POPS HM and increases the POPS HM index round-robin. Figure 9 shows this algorithm using four cores as an example. This process guarantees that memory traffic and
contention are balanced between the memory controllers. Remapping POPSHM address spaces requires no synchronization.

4.1.4. Reduce

HyMR uses both the cacheable private and the cacheable shared address spaces to implement the Reduce stage. The input data of this stage is stored in the private memory of each core. The runtime system stores the reduced data in shared memory. Before executing the reduction, each core has in its private memory a hash table of all <key, list-of-values> pairs on which it must execute the user-defined reduction. The runtime system iterates through each <key, list-of-values> pair and calls the user specified reduce function on it. HyMR provides an iterator interface for the list-of-values that the user can use in the reduction. The result of each call to Reduce call is an output <key,value> pair. HyMR uses shared memory to store these pairs in order to all cores can access these in the next stage.

4.1.5. Sort

DiMR uses a binomial merge algorithm based on message passing. In HyMR, the output is stored in cacheable shared memory instead and all cores execute parallel sorting using regular Sampling (PSRS) [39]. The authors in [39] claim that if the input has no duplicate keys this algorithm has good load-balancing
properties compared to the other parallel sorting algorithms. In MapReduce, the input of this stage has no
duplicate keys.

In PSRS, each core exports in shared memory an array of output <key,value> pairs. In this step, the
runtime has to merge as many arrays as the number of cores into a single array, which is also sorted. Parallel
sorting algorithms choose \( c - 1 \) pivots and split the input into \( c \) partitions, \( c \) the number of cores. The cores
exchange data to retrieve their respective partitions and sort each partition locally. The selection of pivots
is critical for load balancing. A proof of the load balancing properties of this algorithm is provided in [39].

PSRS has four stages. Assume that the runtime system must sort \( n \) keys on \( c \) cores. In the first stage,
each core uses quicksort to sort its share of the elements, which amounts to \( \lceil n/c \rceil \) elements. Each core
selects the data items with indices \( 0, n/c^2, 2n/c^2, \ldots, (c - 1)(n/c^2) \) as a regular sample of its locally sorted
block. In the second stage of the algorithm, one core gathers and sorts the local regular samples. It selects
\( c - 1 \) pivot values from the sorted list of regular samples. The pivot values are at indices \( c + [c/2] - 1, 2c +
[c/2] - 1, \ldots, (c - 1)c + [c/2] \) in the sorted list of regular samples. At this point each core partitions its
sorted sublist into \( c \) partitions, using the pivot values as separators between partitions. In the third stage
of the algorithm, cores exchange partitions. During the fourth stage, each core partitions its
\( c - 1 \) partitions with its private partition into a single list. The values on this list are disjoint from the values on the lists of
other cores. At the end of this stage the elements are sorted in a single array.

HyMR implements a hybrid address space version of PSRS using on-chip MPB buffers for communication,
instead of shared memory, to minimize latency and achieve simple coherence maintenance. Communication
includes the addresses and sizes of intermediate buffers needed by the third stage of this algorithm. The authors in [39]
propose that only one core (without loss of generality, core 0) can choose the samples and sort them to find the actual pivots. This method requires however 2 barriers. Since input data is read-only
and PSRS is not in-place, we can lift the restriction that only one core chooses the pivots. All cores choose
the pivots with the same PSRS algorithm, without synchronization. As all data reside in off-chip shared
memory and all cores can access the data through LUTs, there is no need to execute an all-to-all exchange.
The runtime system allocates space for the output array in shared memory and stores the sorted partitions
in this array.

Figure 10 shows the speedup of the hybrid address space implementation of PSRS over the sequential
libc qsort implementation. We use the same qsort implementation in the first phase of PSRS.

4.2. HyMR MapReduce Optimizations

HyMR uses several additional optimizations that leverage hybrid address spaces.
4.2.1. Optimizing On-Chip Barriers

We revisited the scalable barrier algorithms presented in [40], to explore how these algorithms perform and should be revised in the presence of private, on-chip address spaces with fast communication paths that do not involve off-chip memory. We implemented the algorithms with on-chip data transfers, keeping all shared metadata of each algorithm (e.g. counters) in the MPB buffers and using the cacheable private address space of each core otherwise. We leverage the on-chip shared memory because the shared data needed to implement synchronization algorithms has a very small memory footprint. Furthermore, the runtime system can bypass the L2 cache and use the `CL1INVMB` instruction to invalidate data before reads and the write no-allocate policy with a write combining buffer for writes.

We experimented with the Centralized Barrier, Tournament Barrier, Tree Barrier and Dissemination Barrier from [40]. We compare these algorithms against the barrier implementation provided with RCCE named `RCCE_barrier`. This is a simple, similar to a centralized, counter-based barrier with local sensing but instead of a single counter, each core has its own local counter stored in MPB buffers. This implementation reduces the contention in MPB memory compared to the Centralized Barrier in [40]. Figure 11 compares the barrier implementations. In the Centralized Barrier all shared data is stored in a single MPB. The latency that each core expends to access that MPB depends on the number of hops in the SCC 2D mesh interconnect. The Centralized Barrier algorithm is ill-suited for many-core processors with distributed on-chip memory. The `RCCE_barrier` has the disadvantage that a single root core must update a flag on each other core that participates in the barrier. All other algorithms distribute shared data between MPB buffers in a way that minimizes accesses to remote MPB buffers. Figure 11 indicates that the Dissemination Barrier algorithm is the best fit to the SCC, a result which confirms the result in [40] and generalizes it to chip multi-core processors with non cache-coherent memory.

4.2.2. Interrupt-less Work-Stealing

On the SCC, the latency for accessing DRAM depends on the number of hops that the access must traverse in the chip’s 2D mesh until it reaches a specific memory controller that serves all accesses from the issuing core. In memory-intensive applications this architectural feature can introduce load imbalance. We implement a work stealing algorithm inspired by Cilk [41], using however the MPB to implement fast, on-chip communication between the local core schedulers. Scheduling and work stealing are thus implemented using explicit communication between cores. We implement scheduling deques as non-cacheable queues and preserve coherence for the state of deques using explicit invalidation of entire MPB buffers. We use work-stealing only in the Map stage. Other stages are balanced with the choice of an appropriate hash function during the Map stage. Although we implement the Map stage using distributed address spaces, we choose to implement work-stealing using on-chip shared-memory (MPB buffers). Using the MPB on-
chip shared-memory, a thief can get a portion of work from the victim without interrupting the victim’s execution. Thieves choose victims randomly, as in Cilk.

5. Experimental Analysis

We compare HyMR to DiMR to validate the advantages of using hybrid address spaces over distributed address spaces and explicit communication, in the implementation of scalable runtime systems. We further compare HyMR to Phoenix++, a state-of-art implementation of MapReduce for multi-core systems with hardware-supported cache coherence [15]. We use four benchmarks which are representative of MapReduce applications:

- **WordCount** counts the number of occurrences of each word in text files. The map function splits the input text into words, whereas the reduce function sums the number of occurrences of each word to produce a final count. The number of distinct intermediate keys is the number of distinct words in the text files.
- **Histogram** counts the frequency of occurrences of each RGB color component in an image file. The map function emits the occurrences of each color component in pixels and the reduce function produces the sum of occurrences of each component. The maximum number of distinct intermediate keys is $3 \times 256$.
- **LinearRegression** computes a line of best fit for a set of points, given their 2D coordinates. Map computes intermediate summary statistics for the points like the sum of squares, while reduce gathers all data of each of the summary statistics and calculates the best fit. This benchmark exports 5 intermediate keys.
- **MatrixMultiply** multiplies two dense matrices of integers. In this benchmark the *Map* function implements the matrix multiplication kernel and does not emit any intermediate data. The runtime splits the input and each chunk is a row of each input matrix. The runtime also uses work-stealing to balance the load between the available cores.

We use the same MapReduce algorithms for these benchmarks as Phoenix++ does. This makes the algorithmic comparison of HyMR and DiMR more fair than if we chose to customize the algorithms to our implementation. We choose benchmarks that vary in the number of distinct intermediate keys that they produce, to stress different stages of the MapReduce runtimes. **WordCount** represents one extreme, by exporting as many number of intermediate keys as the number of words in the input text files. **MatrixMultiply** represents the other extreme, since it does not produce any intermediate keys. **Histogram** and **LinearRegression** are between these limits. **Histogram** exports from 0 to 768 distinct intermediate keys depending on the input. **LinearRegression** exports 5 distinct intermediate keys for every input. Benchmarks that emit a large number of intermediate keys stress the Combine, Rearrange and Merge stages. On the other hand, benchmarks that produce no intermediate keys stress the Map stage.

Table 1 lists the MapReduce application workloads that we used for experiments. In order to run these benchmarks in-memory on our SCC board, we maximize the size of the input data sets so that the sum of input, intermediate and output data fits in shared DRAM. We use an SCC node, where each tile of cores runs at a frequency of 800MHz, the mesh interconnect runs at a frequency of 800MHz and DRAM runs at a

<table>
<thead>
<tr>
<th>Application</th>
<th>Input size</th>
</tr>
</thead>
<tbody>
<tr>
<td>WordCount</td>
<td>400 MB</td>
</tr>
<tr>
<td>Histogram</td>
<td>1.6 GB</td>
</tr>
<tr>
<td>LinearRegression</td>
<td>400 MB</td>
</tr>
<tr>
<td>MatrixMultiply</td>
<td>2048 x 2048 Matrices</td>
</tr>
</tbody>
</table>

Table 1: MapReduce application workloads
frequency of 800MHz. We use scckit 1.4.1.3 and each core runs Linux kernel version 2.6.38. We use version 4.5.2 of GCC and G++ compilers.

5.1. Message-Passing vs. Hybrid-Address-Spaces

We first compare DiMR (Section 3) to HyMR (Section 4), in terms of absolute performance. WordCount generates the largest number of distinct intermediate keys among the benchmarks, thus stressing the Combine, Partition and Merge phases of MapReduce. Figure 12 shows the breakdown of execution time of each benchmark with DiMR (left) and HyMR (right). For these results, we use 48 cores of the SCC. We note that in all cases, execution time is dominated by the Map stage. This indicates that both DiMR and HyMR have been heavily optimized to avoid bottlenecks during communication-intensive stages, such as partitioning and sorting [27]. The Map stages includes the Map and Combine phases in our implementation for both runtimes. With hybrid memory, we use work stealing and the HyMR’s optimized combiner. These two optimizations justify why the HyMR Map is faster than the DiMR Map. HyMR also uses a global address space in shared memory for the Partition stage. This allows the runtime system to use a hash table with open addressing to store intermediate data. This data structure enables the implementation of a more efficient combiner. In DiMR, the runtime system stores intermediate data as raw data and the processing of this data adds significant overhead.

The workload of tasks in the Map stage is not the same across tasks. Tasks exhibit variation in their execution time for different chunks of input data, thus load-balancing is necessary in a MapReduce runtime system. A shared address space enables an efficient implementation of interrupt-less load-balancing in HyMR using work-stealing and achieves more effective load balancing than the static data splitting.

The Partition stage is based on an all-to-all exchange, implemented with message passing in DiMR, but on shared memory and through LUT remapping in HyMR. Table 2 shows the speedup that shared memory all-to-all exchange achieves over message passing all-to-all exchange for all benchmarks, using 48 cores. These results illustrate that a cache-bypassing, all-to-all exchange in place in shared memory performs better in all cases. Benchmarks with many intermediate keys have larger performance gains. In MatrixMultiply, the only exception, none of the two runtimes executes the Partition stage.

HyMR and DiMR have identical implementations of the Reduce stage. In the Merge stage, DiMR uses the binomial merge algorithm whereas HyMR uses parallel sorting with regular sampling. Table 2 shows

<table>
<thead>
<tr>
<th>Application</th>
<th>Partition Speedup</th>
<th>Merge Speedup</th>
</tr>
</thead>
<tbody>
<tr>
<td>WordCount</td>
<td>6.64×</td>
<td>9.61×</td>
</tr>
<tr>
<td>Histogram</td>
<td>1.48×</td>
<td>0.69×</td>
</tr>
<tr>
<td>Linear Regression</td>
<td>1.28×</td>
<td>0.78×</td>
</tr>
<tr>
<td>Matrix Multiply</td>
<td>1.00×</td>
<td>1.00×</td>
</tr>
<tr>
<td>GeoMean</td>
<td>1.88×</td>
<td>1.50×</td>
</tr>
</tbody>
</table>

Table 2: Speedup for partition and merge stages computed using DiMR execution time over HyMR execution time using 48 cores.
the speedup that HyMR achieves over DiMR during the Merge stage, for all benchmarks using 48 cores. WordCount has the largest number of output keys and the performance gain is the most significant in comparison to other benchmarks. Histogram and LinearRegression indicate a small slowdown from using hybrid address spaces in the Merge stage. MatrixMultiply does not execute the Merge stage.

5.2. Scalability

Overall, HyMR consistently outperforms DiMR on the SCC. To compare HyMR with Phoenix++, we evaluate the latter on a 48-core cache-coherent multi-processor, with 4 AMD Opteron 6172 processors running at 2.1GHz and 64GB of DRAM. This system runs Linux version 2.6.32 and the 4.7.0 version of GCC and G++ compilers. Our comparison is not a direct one, as the SCC and AMD systems have fundamentally different processors, memory management units and communication substrates. While the cache-coherent AMD system would support distributed memory and hybrid address space implementations, these implementations would all be underpinned by the hardware coherence protocol, which would render message passing with direct core-to-core communication, as in the SCC, infeasible. Conversely, a shared memory implementation of the runtime system on SCC would require a software virtual memory coherence protocol, which is hard to scale on many cores. It is for these reasons that we compare MapReduce implementations on different platforms and use two metrics that partially neutralize the underlying architecture: scalability in terms of speedup and percentage of peak data processing bandwidth (bandwidth utilization) achieved by each implementation.

Figure 13 indicates that in all cases HyMR achieves almost linear speedup whereas Phoenix++ encounters scalability bottlenecks, usually at 32 cores. To calculate speedup we use execution time with four cores as a baseline. We multiply this value by four to predict the execution time on one core, assuming that benchmarks scale perfectly up to 4 cores (a hypothesis that is confirmed in reality). We cannot obtain reasonable direct execution times on one core as the datasets used are too big to fit in the memory accessible to any core in the system. In both HyMR and Phoenix++, the execution time dominated by the Map stage (Figure 12), which includes the Combine stage in both implementations. These stages are fully parallel, with no application data communication and low synchronization activity between cores. The authors in [15] evaluate Phoenix++ using an Intel machine consists of 4× Nehalem-EX processors with 4 NUMA nodes. We use 4× AMD Opteron 6172 processors, in a system with a more complicated NUMA design, which includes 8 NUMA nodes. Further experiments suggest that NUMA effects are more pronounced in AMD machines rather than in Intel machines. The results of Phoenix++ are sub-optimal due to inopportune data placement on NUMA nodes, despite that Phoenix++ is NUMA-aware by design. Another problem of Phoenix++ is false sharing, as an effect of data structure layout and the hardware-supported cache-coherence protocol. HyMR uses distributed memory during Map and Combine stages. This allows HyMR to solve the false sharing problem. The scalability gap between HyMR and Phoenix++ increases with the number of cores.
5.3. Sustained to Peak Bandwidth

As MapReduce fundamentally targets data-intensive applications, the data processing bandwidth of the MapReduce runtime system is a proper metric for evaluation. We compare the bandwidth that each benchmark achieves normalized to the peak data streaming bandwidth in each of our two platforms. In both cases we measure the peak bandwidth using the STREAM benchmark [42, 43] (Triad case). Figure 14 shows the peak bandwidth that each system achieves, as reported by the STREAM benchmark. AMD Opteron cores run in 2.1GHz and use 64GB DRAM clocked at 1333MHz, while and SCC cores in 800MHz and use 32GB DRAM clocked at 800MHz. AMD Opteron processors also have a significantly more efficient ALU than the outdated Pentium-class cores used on the SCC. These differences justify the gap in available memory bandwidth between the two architectures. Despite this difference, we note that available bandwidth scales well with the number of cores on the SCC but reaches a point of saturation at 32 cores on the AMD system.

We measure the bandwidth that each benchmark achieves with HyMR and Phoenix++. We normalize the measurements with the peak bandwidth of the platform on which each runtime executes. This is an efficiency metric with an ideal value of 1. Figure 15 shows that in WordCount, Histogram and LinearRegression the bandwidth efficiency of HyMR exceeds the efficiency of Phoenix++. Phoenix++ achieves higher bandwidth efficiency only in MatrixMultiply, where the required memory bandwidth is at any rate low, as the benchmark exhibits excellent locality. On average HyMR achieves 3.18× better bandwidth efficiency than Phoenix++ on 48 cores.

5.4. Discussion

We analyze the reasons behind the performance gap between HyMR and DiMR in this section. DiMR uses the Intel RCCE, a lightweight message passing library optimized for the SCC. This library provides
the basic primitives RCCE\textunderscore send, recv. In order to send a message the sender puts the message in the local MPB. After the necessary synchronization the receiver copies the data to its private L1 cache and then to DRAM. This results in large data transfers in the \textit{Partition} stage in MapReduce. By contrast, in HyMR, all cores access data from the shared DRAM and thus avoid unnecessary data copies. The \textit{Partition} stage is the main scalability bottleneck in DiMR. In order to move data between cores, data must be kept in raw buffers. This affects the performance and scalability of the \textit{Map} stage as well. HyMR does not execute unnecessary data copies and uses more efficiently accessible data structures to store intermediate data. Furthermore, MapReduce can be implemented so that each processor accesses only private data, which in turn negates the need for cache coherence. Using shared DRAM and private processor caches without maintaining cache coherence is ideal in this scenario. In Section 6 we examine a different scenario where processors must access both private and shared data.

Synchronization messages used in MapReduce are small in size and may also prevent scaling. DiMR uses on-chip shared memory for synchronization. Although faster than shared DRAM, on-chip shared memory has limited size. Each core has its own cache hierarchy where both application data and synchronization metadata is loaded. In applications with frequent synchronization operations, synchronization metadata invalidate and flush application data out of the cache with severe performance implications. Bypassing the cache hierarchy and using MPB buffers for synchronization metadata and communication of short messages is the ideal choice for runtime systems based on message passing. Conversely, not using MPB buffers for application data improves communication and synchronization performance. Selective invalidation or flushing of data in specific addresses in the cache might alleviate this problem. HyMR uses on-chip MPBs and message passing for barriers and task queues with work-stealing.

Finally, the runtime system must be aware of the 2D-mesh interconnect of the processor. If a processor accesses data through a non-local memory controller or when several cores are accessing the same data simultaneously through the same memory controller introduces, memory accesses suffer from significantly increased latency. The optimization shown in Figure 9 resolves this problem, by balancing accesses to shared memory and each memory controller in the \textit{Partition} stage.

6. HyRMA: Using Hybrid Address Spaces to Develop RMA Programming Models

In this section we show the effectiveness of hybrid address spaces in parallel applications using an RMA (Remote Memory Access) programming model. Our model implementation (HyRMA) allows the programmer to explicitly place data in shared cacheable DRAM or private non-cacheable scratch space. The model then uses one-way data transfers between any of these memory spaces to optimize communication paths. We use the Jabobi method as a use case to demonstrate this programming model.

Jacobi is an iterative algorithm which, given a set of boundary conditions, finds discretized solutions to differential equations of the form $\nabla^2 A + B = 0$. Each step of the algorithm replaces each node of a grid with
the average of the values of its nearest neighbors. To demonstrate the advantages of HyRMA we compare a version of Jacobi that uses message passing for communication with a HyRMA version that uses one-way transfers from global shared memory to on-chip caches and vice versa, as well as between on-chip MPBs.

6.1. Design & Implementation

Assume a Jacobi method for a two-dimensional $N \times N$ grid. To find a solution on the grid, the method repeatedly applies the following iterative step:

$$A_{i,j}^{k+1} = \frac{A_{i+1,j}^k + A_{i-1,j}^k + A_{i,j+1}^k + A_{i,j-1}^k}{4} \quad (4)$$

where $i,j$ are indices on the two-dimensional array and $k$ the iteration number. This step is applied until the method converges to a solution. For convergence testing, the method computes:

$$\text{diff} = \sqrt{\sum_{0 \leq i,j < N} (A_{i,j}^{k+1} - A_{i,j}^k) \times (A_{i,j}^{k+1} - A_{i,j}^k)} \quad (5)$$

and iterates until $\text{diff} \leq 0.01$. To parallelize this algorithm we divide the rows of the two-dimensional array by the number of available cores. Each core gets a $\left\lceil \frac{N \times N \times N}{\text{cores}} \right\rceil$ sub-array on which it can compute in parallel with other cores, at every iteration of the algorithm. Cores must exchange boundary data –upper and lower row– of their sub-arrays with their respective neighbors between iterations and check the error (convergence criterion), first locally and then cumulatively across all cores, to decide if more iterations are necessary for convergence.

6.1.1. Message Passing Implementation

In the message passing implementation of Jacobi we use RCCE $\{$send, recv$\}$ calls to exchange neighbor rows. To merge error values we use a customized implementation of MPI_AllReduce algorithm with RCCE primitives.

6.1.2. HyRMA

In the HyRMA implementation we store the array in cacheable shared DRAM, which accelerates the compute kernel of Jacobi. We distribute the data similarly to a partitioned shared address space approach. The data distribution in shared DRAM maximizes DRAM access locality from cores –equivalently, minimizes data transfer latency through the SCC on-chip interconnection network– and minimizes contention at memory controllers. However, using exclusively cacheable shared RAM implies that data exchange and synchronization between iterations would necessitate expensive cache flushing operations. We leverage hybrid address spaces to alleviate this problem, by allocating the boundary rows that cores must exchange in MPB buffers which are not cacheable in the L2 cache. We use direct one-way transfers to MPB buffers to implement the row exchanges between Jacobi iterations. We also use one-way transfers to MPBs to compute the cumulative error and check for convergence. We implement a based dissemination barrier (Section 4.2.1), also leveraging the MPBs.

6.2. Experimental Analysis

We compare the two implementations of Jacobi with message passing and HyRMA. Figure 16 shows the comparison of the two programming models. The left figure shows the execution time for 4 to 48 cores. The right figure shows speedup in the same range. The HyRMA implementation is faster than the message passing implementation and scales better as the number of cores increases. The main contributor to this difference is the reduction of communication latency via the use of on-chip one-way transfers for data exchanges and convergence checking. The optimal distribution of core-private data in the HyRMA version and the avoidance of costly data exchanges through the L2 caches and on-chip interconnect further contribute to the performance gap between HyRMA and message passing.
6.3. Discussion

The reason behind the low performance of the message passing implementation is redundant data copies. The RCCE library copies the data to the correct MPB buffer and then to DRAM for send/recv operations. These copies make data exchanging almost as expensive as computation, which in Jacobi is fully parallel. The impact of copying is more pronounced on 16 or more cores and affects speedup. HyRMA removes the need to copy data back to DRAM. The runtime system accesses data directly from MPBs to perform computation. The runtime system also bypasses the L2 caches and does not pollute them with useless data from the exchanges. Finally, accesses to MPB buffers are always performed to neighbor MPB buffers and thus minimize latency in the on-chip interconnect.

Despite the aforementioned optimizations HyRMA does not scale as well as HyMR. The reason is that accesses from MPB buffers bypass the L1 cache and incur additional latency. HyMR does not access application data from MPB buffers, which are used only for synchronization and load balancing. HyRMA necessitates cache bypassing for synchronization and data communication. Nevertheless, HyRMA still scales better than a message passing approach.

7. Related Work

Several prior research efforts ported MapReduce to prominent hardware platforms for high-performance computing, including cache-coherent multi-core processors [23, 24, 15, 25, 33] and non cache-coherent multi-core processors [26, 27].

Phoenix, a port of MapReduce for cache-coherent shared-memory multi-core systems [23, 24, 15], exploits locality implicitly by controlling the granularity of tasks and the assignment of tasks to cores. Phoenix performs dynamic assignment of map and reduce tasks to cores. It controls task sizes so that the working set of each task fits in the L1 cache of each core. Phoenix also provides an option to perform prefetching in the L2 data cache. The main focus in the design of Phoenix is on achieving scalability through NUMA-aware memory management. Each map thread emits intermediate results on a space allocated on the closest memory module to the CPU the thread is scheduled on.

In [24], the authors use a multi-layer approach to optimize the runtime system. These layers include the algorithm, the implementation and the runtime-OS interaction. In the most recently published version of Phoenix [15] the authors provides a modular, flexible pipeline that can be easily adapted by the user to the characteristics of a particular workload while allowing users to write simple, strict MapReduce code. In [33] the authors explore the design of the MapReduce data structures for grouping intermediate <key,value> pairs. A different approach to optimize Phoenix is proposed in [25] where the authors use ”tiling strategy” to minimize task memory footprints and improve cache locality. HyMR differs from Phoenix in that it leverages both distributed and shared address spaces on-demand, to improve scalability. However, the design and implementation of HyMR do not prevent the horizontal (cache-level) or vertical (NUMA DRAM-level) locality optimizations implemented in Phoenix++.

High-performance implementations of MapReduce have also been available on systems with distributed address spaces, most notably the Cell BE processor [26, 27]. In these implementations, the runtime system controls locality explicitly, using DMAs and software prefetching via multi-buffering in the map, merge and sort stages. Contrary to Phoenix, the runtime system neither hashes nor partitions keys in per-core buffers, thereby eliminating memory copies, while allowing a balanced distribution of work during the sort and reduce stages. HyMR, contrary to the prior implementations of MapReduce on Cell, leverages both distributed and shared address spaces. The use of a shared address space with cache bypassing in HyMR enables more efficient exchanges of large volumes of data between cores.

Recently, implementations of MapReduce using Partitioned Global Address Space Languages, such as X10 [44], and Unified Parallel C [45], have demonstrated superior performance to prior implementations based on Hadoop (for distributed address space systems) and Phoenix++ (for shared address space system). These implementations use a virtualized shared address space, thus missing opportunities to leverage maximally on-chip communication for latency-critical MapReduce operations. Furthermore, they delegate the control of scheduling and data transfers to the underlying language runtime system, which provides
generic, rather than MapReduce-specific memory coherence and consistency mechanisms, thus introducing additional performance inefficiencies.

Prior research has synthesized memory models and programming models to achieve more efficient parallelization on cluster architectures with SMP or multi-core nodes [46, 47, 48]. While these prior propositions provide abstractions of private and shared address spaces, they implement those address spaces on top of a common hardware substrate and do not customize the communication path for any given address space. Furthermore, the split address spaces in prior research are explicitly managed by programmers, a burden which we avoid in our work by implementing programming models that present a global address space abstraction to programmers but implement this abstraction using multiple physical address spaces and custom communication paths.

8. Conclusions

This paper presented a design and implementation of MapReduce using hybrid address spaces. Future and emerging many-core processors, such as Intel’s Runnemedee [5], will provide communication pathways through distributed address spaces or shared address spaces, both on-chip and off-chip. The idea elaborated in this work is to use distributed address spaces in runtime system stages where cores share no application data and need to exchange only control messages for the purposes of scheduling and load balancing. The absence of a hardware cache coherence protocol allows runtime systems to scale almost perfectly in shared-nothing stages. On the contrary, runtime stages where cores exchange significant volumes of application data are best implemented in an off-chip shared address space. Where data is streamed and there is no opportunity for data reuse, bypassing caches is the most performant implementation option. This paper further argues that in staged runtime systems, an application-specific implementation of memory coherence is scalable and performant. In MapReduce specifically, the Map and Reduce stages are embarrassingly parallel and running them over a hardware or software cache coherence protocol results in a consistent performance hit. We have also implemented an RMA programming model using hybrid address spaces and used a stencil code to prove its superiority to a message passing model.

Acknowledgments

The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under the NanoStreams project, grant agreement n° 610509 and the I-CORES project, grant agreement n° 224759.

References

Appendix A. PSRS Algorithm

<table>
<thead>
<tr>
<th>Sizes</th>
<th>32cores ref</th>
<th>32cores new</th>
<th>48cores ref</th>
<th>48cores new</th>
</tr>
</thead>
<tbody>
<tr>
<td>4M</td>
<td>26.49</td>
<td>23.66</td>
<td>23.56</td>
<td>28.97</td>
</tr>
<tr>
<td>8M</td>
<td>27.29</td>
<td>30.56</td>
<td>30.33</td>
<td>40.12</td>
</tr>
</tbody>
</table>

Table A.3: Speedup of new PSRS algorithm (new) and original PSRS algorithm (ref).

```c
struct sub_array {
    size_t begin;
    size_t length;
};

/* allocate on-chip shared memory of size bytes in id’s MPB buffer */
void *mpballoc(int id, size_t size);

/* allocate off-chip shared memory of size bytes */
void *shmalloc(size_t size);

/* an PSRS algorithm to sort arrays of integer */
int *PSRS(int **array, size_t *array_size, int id, int num_cores)
{
    int p = num_cores;
    int p_1 = num_cores - 1;
    int pp_1 = num_cores * (num_cores - 1);
    size_t total_size = 0;
    for (i = 0; i < p; i++)
        total_size += array_size[i];
    qsort(array[id], array_size[id]); /* sort the local partition */
    cache_flush(); /* L2 cache flush */
    barrier();

    /* choose the samples */
    int sample[pp_1];
    for (i = 0; i < p; i++)
    {
        rsize = (array_size[i] + p_1)/p;
        for (j = 0; j < p_1; j++)
            sample[i*p_1+j] = array[i][(j+1)*rsize];
    }
    qsort(sample, pp_1);

    /* choose the pivots */
    int pivots[p_1];
    for (i = 0; i < p_1; i++)
        pivots[i] = sample[i*p + p/2];

    struct sub_array *sa = mpballoc(id, p*sizeof(struct sub_array));
    sublists(array[id], array_size[id], sa, pivots, p_1); /* same algorithm as original PSRS paper */
    wcb_flush(); /* Write-Combine-Buffer flush */
    barrier();

    struct sub_array l_index[p];
    for (i = 0; i < p; i++)
```
size_t count = 0;
for (i = 0; i < p; i++)
    count += l_index[i].length;
out_size[id] = count;
wcb_flush(); /* Write-Combine-Buffer flush */
barrier();
CLIINVMB(); /* invalidate MPB entries stored in L1 cache */
size_t out_begin = 0;
for (i = 0; i < id; i++)
    out_begin += out_size[i];
int *output_array = shmalloc(n,sizeof(int));
int *out_arr = &(output_array[out_begin]);
size_t size_arr[p];

heap_merge(data_arr, size_arr, p, out_arr); /* using a heap based merge algorithm */

barrier();

return the output array stored in shared memory */
return output_array;