The Twentieth International Conference on
PACT 2011 Accepted Papers
The following papers have been accepted for presentation at PACT 2011. The schedule will be posted as soon as it is available.
Authors: Kevin J. Brown, Arvind K. Sujeeth, and HyoukJoong Lee (Stanford University), Tiark Rompf (EPFL), and Hassan Chafi and Kunle Olukotun (Stanford University)
Abstract: Computing systems are becoming increasingly parallel and heterogeneous, and therefore new applications must be capable of exploiting parallelism in order to continue achieving high performance. However targeting these emerging devices often requires using multiple disparate programming models and making decisions that can limit forward scalability. We propose the use of domain-specific languages (DSLs) to provide high-level abstractions that enable transformations to high performance parallel code without degrading programmer productivity. We present the Delite Compiler Framework and Runtime environment, an end-to-end system for executing DSL applications on parallel heterogeneous hardware. The framework lifts embedded DSL applications to an intermediate representation (IR), performs general- purpose, parallel, and domain-specific optimizations, and generates an execution graph that targets multiple heterogeneous hardware devices. Finally we present results comparing the performance of several machine learning applications written in OptiML, a DSL for machine learning which utilizes Delite, to C++ and MATLAB implementations and find that the implicitly parallel OptiML applications achieve single-threaded performance comparable to C++ and outperforms explicitly parallel MATLAB in nearly all cases.
A Unified Scheduler for Recursive and Task Dataflow Parallelism
Authors: Hans Vandierendonck (UGent, FORTH-ICS) and George Tzenakis and Dimitrios S. Nikolopoulos (FORTH-ICS)
Abstract: Task-based languages simplify the specification of high-performant parallel programs by dynamically detecting and enforcing dependencies between tasks. These languages are, however, often restricted to a single level of parallelism. This language design is reflected in the runtime system, where a master thread explicitly generates a task graph and worker threads execute ready tasks and wake- up their dependents. Such an approach is incompatible with state-of-the-art schedulers such as the Cilk scheduler, that minimize the creation of idle tasks (work-first principle) and places all task creation and scheduling off the critical path. This paper proposes an extension to the Cilk scheduler in order to reconcile task dependencies with the work-first principle. We discuss the impact of task dependencies on the properties of the Cilk scheduler. Furthermore, we propose a low-overhead ticket-based technique for dependency tracking and enforcement at the object level. Our scheduler also supports renaming of objects in order to increase task-level parallelism. Renaming is implemented using versioned objects, a new type of hyperobject. Experimental evaluation shows that the new scheduler is as efficient as the Cilk scheduler when tasks have no dependencies. Moreover, the new scheduler is as efficient as SMPSS, a particular implementation of a task-based language.
An Evaluation of Vectorizing Compilers
Authors: Saeed Maleki, David Padua, and Maria J. Garzaran (University of Illinois at Urbana-Champaign) and Yaoqing Gao and Tommy Wong (IBM Toronto Laboratory)
Abstract: Most of todays processors come with vector units that have been designed to speed up single threaded programs. Although vector instructions can deliver high performance, writing vector code in assembly or using intrinsic is a very time consuming and error-prone task. Fortunately, vectorizing compilers can generate vector codes for different platforms. But, unfortunately, they do not do a perfect job. This paper evaluates how powerful the vectorizing compilers are in generating vector code and it is based on a synthetic benchmark, two application from Petascale Application Collaboration Teams (PACT) and eight applications from Media Bench II. Our results show that despite all the work done in vectorization in the last 40 years less than 60% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers we evaluated. It also provides an analysis for the issues that prevent them from vectorization, and the transformations that the programmer needs to apply to enable automatic vectorization.
An OpenCL Framework for Homogeneous Manycores with No Hardware Cache Coherence
Authors: Jun Lee, Jungwon Kim, Junghyun Kim, Sangmin Seo, and Jaejin Lee (Seoul National University)
Abstract: Recently, Intel has introduced a research prototype manycore processor named single-chip cloud computer (SCC). The SCC contains 48 cores in a single chip and each core has its own L1 and L2 caches without any hardware support for cache coherence. It allows maximum 64GB size of external memory that can be accessed by all cores and each core dynamically maps the external memory into their own address space. In this paper, we introduce the design and implementation of an OpenCL framework (i.e., runtime and compiler) for such manycore architectures with no hardware cache coherence. We have found that the OpenCL coherence and consistency model fits well with the SCC architecture. The OpenCLÕs weak memory consistency model requires relatively small amount of messages and coherence actions to guarantee coherence and consistency between the memory blocks in the SCC. The dynamic memory mapping mechanism enables our framework to preserve the semantics of the buffer object operations in OpenCL with a small overhead. We have implemented the proposed OpenCL runtime and compiler and evaluate their performance on the SCC with OpenCL applications.
ARIADNE: Agnostic Reconfiguration In A Disconnected Network Environment
Authors: Konstantinos Aisopos (Princeton University), Andrew DeOrio (University of Michigan), Li-Shiuan Peh (Massachusetts Institute of Technology), and Valeria Bertacco (University of Michigan)
Abstract: Extreme transistor technology scaling is causing increasing concerns in device reliability: the expected lifetime of individual transistors in complex chips is quickly decreasing, and the problem is expected to worsen at future technology nodes. With complex designs increasingly relying on Networks-on-Chip (NoCs) for on-chip data transfers, a NoC subsystem must continue to operate even in the face of many transistor failures. Specifically, it must be able to reconfigure and reroute packets around faults to enable continued operation, i.e., generate new routing paths to replace the old ones upon a failure. In addition to these reliability requirements, NoCs must continue to maintain low latency and high throughput during normal operation, at very low area overhead. In this work, we propose a distributed reconfiguration solution named Ariadne, targeting large, aggressively scaled, unreliable NoCs. Ariadne utilizes up*/down* for fast routing at high bandwidth, and upon any number of concurrent network failures in any location, it reconfigures to discover new resilient paths to connect the surviving nodes. Experimental results show that Ariadne provides a 40%-140% latency improvement (when subject to 50 faults in a 64-node NoC) over other on-chip state-of-the-art fault tolerant solutions, while meeting the low area budget of on-chip routers with an overhead of just 1.97%.
Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs
Authors: Meng-Ju Wu and Donald Yeung (Department of Electrical and Computer Engineering, University of Maryland at College Park)
Abstract: Reuse distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle though is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD profiles architecture dependent, preventing them from analyzing different processor configurations. For loop-based parallel programs, CRD profiles shift coherently to larger CRD values with core count scaling because interleaving threads are symmetric. Simple techniques can predict such shifting, making the analysis of numerous multicore configurations from a small set of CRD profiles feasible. Given the ubiquity and scalability of loop-level parallelism, such techniques will be extremely valuable for studying future large multicore designs. This paper investigates using RD analysis to efficiently analyze multicore cache performance for loop-based parallel programs, making several contributions. First, we provide in-depth analysis on how CRD profiles change with core count scaling. Second, we develop techniques to predict CRD profile scaling, in particular employing reference groups to predict coherent shift, and evaluate prediction accuracy. Third, we show core count scaling only degrades performance for last level caches (LLCs) below 16MB for our benchmarks and problem sizes, increasing to 64-128MB if problem size scales by 64x. Finally, we apply CRD profiles to analyze multicore cache performance. When combined with existing problem scaling prediction, our techniques can predict LLC MPKI to within 11.1% of simulation across 1,728 configurations using only 36 measured CRD profiles.
Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries
Authors: Bin Ren and Gagan Agrawal (The Ohio State University)
Abstract: Programmer productivity considerations are increasing the popularity of interpreted languages like Python. At the same time, for applications where performance is important, these languages clearly lack even on uniprocessors. In addition, the use of dynamic data structures in a language like Python makes it very hard to use emerging libraries for enabling the execution on multi-core and many- core architectures. This paper presents a framework for compiling Python to use multi-core and many-core libraries. The key component of our framework involves a suite of algorithms for replacing dynamic and/or nested data structures by arrays, while minimizing unnecessary data copying costs. This involves a novel use of an existing partial redundacy elimination algorithm, development of a new demand- driven interprocedural partial redundacy algorithm, a data flow formulation for determining that the contents of the data structure are of the same type, and a linearization algorithm. We have evaluated our framework using data mining and two linear algebra applications written in pure Python. The key observations were: 1) the code generated by our framework is only 10% to 20% slower compared to the hand-written C code that invokes the same libraries, 2) our optimizations turn out to be significant for improving the performance in most cases, and 3) we outperform interpreted Python and the C++ code generated by an existing tool by one to two orders of magnitude.
Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU
Authors: Ziyu Guo, Eddy Z. Zhang, and Xipeng Shen (The College of William and Mary)
Abstract: Automatic compilation for multiple types of devices is important, especially given the current trends towards heterogeneous computing. This paper concentrates on some issues in compiling fine- grained SPMD-threaded code (e.g., GPU CUDA code) for multicore CPUs. It points out some correctness pitfalls in existing techniques, particularly in their treatment to implicit synchronizations. It then describes a systematic dependence analysis specially designed for handling implicit synchronizations in SPMD-threaded programs. By unveiling the relations between inter-thread data dependences and correct treatment to synchronizations, it presents a dependence-based solution to the problem. Experiments demonstrate that the proposed techniques can resolve the correctness issues in existing compilation techniques, and help compilers produce correct and efficient translation results.
DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory
Authors: Carlos Villavieja (UPC,BSC), Vasilis Karakostas (BSC), Lluis Vilanova (UPC,BSC), Yoav Etsion (BSC), Alex Ramirez (UPC, BSC), Avi Mendelson (Microsoft), Nacho Navarro (UPC, BSC) Adrian Cristal (BSC) and Osman Unsal (BSC)
Abstract: Translation Lookaside Buffers} (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shootdown. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact TLB shootdowns on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shootdown cost and frequency increase with the number of processors and project that software-based TLB shootdowns would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shootdowns by an order of magnitude.
Divergence Analysis and Optimizations
Authors: Bruno Rocha Coutinho, Diogo Nunes Sampaio, Fernando Magno Quintao Pereira, and Wagner Meira Jr. (UFMG)
Abstract: The growing interest in GPU programming has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers a tremendous computational power; however, the model also brings restrictions. In particular, processing elements (PEs) execute in lock-step, and may lose performance due to divergences caused by conditional branches. In face of divergences, some PEs execute, while others wait; this alternation ending when they reach a synchronization point. In this paper we introduce divergence analysis, a static analysis that determines which program variables will have the same values for every PE. This analysis is useful in three different ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the compiler in the optimization of SIMD programs. We demonstrate this last point by introducing branch fusion, a new compiler optimization that identifies, via a gene sequencing algorithm, chains of similarities between divergent program paths, and weaves these paths together as much as possible. Our implementation has been accepted in the Ocelot open-source CUDA compiler, and is publicly available. We have tested it on many industrial-strength GPU benchmarks, including Rodinia and the Nvidia's SDK. Our divergence analysis has a 34% false- positive rate, compared to the results of a dynamic profiler. Our automatic optimization adds 4% speed-ups onto these already heavily optimized benchmarks. Our manual optimizations extend these numbers to speed-ups of over 10%.
Dynamic Fine-Grain Scheduling of Pipeline Parallelism
Authors: Daniel Sanchez (Stanford), David Lo (Stanford), Richard M. Yoo (Stanford), Jeremy Sugerman (Stanford), and Christos Kozyrakis (Stanford)
Abstract: Scheduling pipeline-parallel programs, defined as a graph of stages that communicate explicitly through queues, is challenging. When the application is regular and the underlying architecture can guarantee predictable execution times, several techniques exist to compute highly optimized static schedules. However, these schedules do not admit run-time load-balancing, so run-time variability introduced by the application or the underlying hardware causes load imbalance, hindering performance. On the other hand, existing schemes for dynamic fine-grain load balancing (such as task-stealing) do not work well on pipeline-parallel programs: they cannot guarantee memory footprint bounds, and do not adequately schedule complex graphs or graphs with ordered queues. We present a scheduler implementation for pipeline-parallel programs that performs fine-grain dynamic load balancing efficiently. Specifically, we implement the first real runtime for GRAMPS, a recently proposed programming model that focuses on supporting irregular pipeline and data parallel applications (in contrast to classical stream programming models and schedulers, which require programs to be regular). Task-stealing with per- stage queues and queuing policies, coupled with a backpressure mechanism, allow us to maintain strict footprint bounds, and a buffer management scheme based on packet-stealing allows low-overhead and locality-aware dynamic allocation of queue data. We evaluate our runtime on a multi-core SMP and find that it provides low-overhead scheduling of irregular workloads while maintaining locality. We also show that a GRAMPS scheduler outperforms several other commonly used scheduling approaches. Specifically, while a typical task-stealing scheduler performs on par with GRAMPS on simple graphs, it does significantly worse on complex ones; a canonical GPGPU scheduler cannot exploit pipeline parallelism and suffers from large memory footprints; and a typical static, streaming scheduler achieves somewhat better locality, but suffers significant load imbalance on a general-purpose multicore due to fine-grain architecture variability (e.g., cache misses and Hyper-Threading).
Efficient Parallel Graph Exploration for Multi-Core CPU and GPU
Authors: Sunpgack Hong (Stanford), Tayo Oguntebi (Stanford), and Kunle Olukotun (Stanford)
Abstract: Graphs are fundamental data representation that has been used extensively in various domains. In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of their massive data sets. In this paper, we present a new method for implementing the parallel BFS algorithm on multi-core CPU targets. By utilizing memory bandwidth more efficiently, our method shows improved performance over the current state-of-the-art implementation and increases its advantage as the size of the graph increases. We then propose a hybrid method which, for each level of the BFS algorithm, dynamically chooses the best implementation from: a sequential execution, two different methods of multi-core execution, and a GPU execution. Such a hybrid approach provides the best performance for each graph size while avoiding poor worst-case performance per level. Finally, we study the effects of the underlying architecture on BFS performance by comparing multiple CPU and GPU systems; a high-end GPU system performed as well as a quad-socket highend CPU system.
Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control
Authors: Bo Wu, Eddy Z. Zhang, and Xipeng Shen (The College of William and Mary)
Abstract: Many dynamic simulation programs contain complex, irregular memory reference patterns, and require runtime optimizations to enhance data locality. Current approaches periodically stop the execution of an application to reorder the computation or data based on the current program state to improve the data locality for the next period of execution. In this work, we examine the implications that modern heterogeneous Chip Multiprocessors (CMP) architecture imposes on the optimization paradigm. We develop three techniques to enhance the optimizations. The first is asynchronous data transformation, which moves data reordering off the critical path through dependence circumvention. The second is a novel data transformation algorithm, named TLayout, designed specially to take advantage of modern throughput-oriented processors. Together they provide two complementary ways to attack a benefit-overhead dilemma inherited in traditional techniques. Working with a dynamic adaptation scheme, the techniques produce significant performance improvement for a set of dynamic simulation benchmarks.
Exploiting Task Order Information for Optimizing Sequentially Consistent Java Programs
Authors: Christoph M. Angerer and Thomas R. Gross (ETH Zurich)
Abstract: Java was designed as a secure language that supports running untrusted code as part of trusted applications. For safety reasons, Java therefore defines a memory model that prevents undefined behavior in multi-threaded programs even if the programs are not correctly synchronized. Because of the potential negative performance impact the Java designers did not choose a simple and natural memory model, such as sequential consistency, but instead developed a relaxed memory model that gives the compiler more optimization opportunities. As it is today, however, the relaxed Java Memory Model is not only hard to understand but it unnecessarily complicates reasoning about parallel programs and it turned out to be difficult to implement correctly. This paper presents an optimizing compiler for a Java version that has sequential consistency as its memory model. Based on a programming model with explicit happens-before constraints between tasks, we describe a static schedule analysis that computes whether two tasks may be executed in parallel or if they are ordered. During optimization, the task-ordering information is exploited to reduce the number of volatile memory accesses the compiler must insert to guarantee sequential consistency. The evaluation shows that scheduling information significantly improves the effectiveness of the optimizations. For our set of multi-threaded benchmarks the fully optimizing compiler removes between 70% and 100% of the volatile memory accesses inserted by the non-optimizing compiler. As a result, the overhead of sequentially consistent Java compared to standard Java is reduced from 136% on average for the unoptimized version to 11% on average for the optimized version. The results indicate that with appropriate optimizations, sequential consistency can be a feasible alternative to the Java Memory Model.
Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling
Authors: Jungseob Lee, Vijay Satish, and Katherine Compton (Univ. of Wisconsin-Madison), Mike Schulte (AMD), and Nam Kim (Univ. of Wisconsin-Madison)
Abstract: State-of-the-art graphic processing units (GPUs) can offer very high computational throughput for highly parallel applications using hundreds of available cores. In general, the peak throughput of a GPU is proportional to the number of integrated cores, but the number of cores running simultaneously is often limited by the maximum power consumption allowed. According to our analysis, the number of operating cores and the frequencies of cores and on-chip interconnects/caches impact the throughput substantially depending on the characteristics of applications. For example, the throughput of some applications can be increased with more cores while that of other applications cannot due to the limited (i) parallelism of applications and/or (ii) bandwidth of on-chip interconnects/caches and off-chip memory. In this paper, we first examine how the number of operating cores and their voltage/frequency impact the throughput of applications under a power constraint. Second, on-chip interconnects/caches also consume a notable fraction of total GPU power. Thus, we analyze how the number of operating cores and the voltages/frequencies of cores and on-chip interconnects/caches affect the throughput of applications for a given power constraint. Finally, we propose a dynamic scaling algorithm that can determine an appropriate number of operating cores and their voltage/frequency to improve the overall throughput based on an application's runtime behavior. Our experimental results show that a GPU using the runtime dynamic scaling algorithm can provide 15% higher throughput than the baseline GPU for a given power constraint.
Large Scale Verification of MPI Programs Using Lamport Clocks with Lazy Update
Authors: Anh Vo, Ganesh Gopalakrishnan, and Robert M. Kirby (University of Utah) and Bronis R. de Supinski, Martin Schulz, and Greg Bronevetsky (Lawrence Livermore National Laboratory)
Abstract: We propose a dynamic verification approach for large-scale message passing programs to locate correctness bugs caused by unforeseen nondeterministic interactions. This approach hinges on an efficient protocol to track the causality between nondeterministic message receive operations and potentially matching send operations. We show that causality tracking protocols that rely solely on logical clocks fail to capture all nuances of MPI program behavior, including the variety of ways in which nonblocking calls can complete. Our approach is hinged on formally defining the matches-before relation underlying the MPI standard, and devising lazy update logical clock based algorithms that can correctly discover all potential outcomes of nondeterministic receives in practice. In particular, our far cheaper Lazy Lamport Clocks Protocol (LLCP) can achieve the same coverage as a vector clock based algorithm while maintaining good scalability. LLCP allows us to analyze realistic MPI programs involving a thousand MPI processes, incurring only modest overheads in terms of communication bandwidth, latency, and memory consumption.
Linear-time Modeling of Program Working Set in Shared Cache
Authors: Xiaoya Xiang, Bin Bao, and Chen Ding (University of Rochester) and Yaoqing Gao (IBM Toronto)
Abstract: Many techniques are based on the notion of the working set, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n^2) windows in an n-element trace. Two recent techniques have significantly reduced the measurement cost, but the cost is still too high for realsize workloads. The working set is known as the program footprint. Instead of measuring all footprint sizes, this paper presents a technique for measuring the average footprint size. By confining the analysis to the average rather than the full range, the problem can be solved accurately by a linear-time algorithm. The paper presents the algorithm and evaluates it using the complete sets of 26 SPEC2000 and 29 SPEC2006 benchmarks. The new algorithm is compared against the previously fastest algorithm in both the speed of the measurement and the accuracy of shared-cache performance prediction.
Making STMs Cache Friendly with Compiler Transformations
Authors: Sandya Mannarswamy (Indian Institute of Science, Bangalore & Hewlett Packard) and Prof Ramaswamy Govindarajan (Indian Institute of Science, Bangalore)
Abstract: Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.
Memory Architecture for Integrating Emerging Memory Technologies
Authors: Kun Fang (University of Illinois at Chicago), Long Chen and Zhao Zhang (Iowa State University), and Zhichun Zhu (University of Illinois at Chicago)
Abstract: Current main memory system design is severely limited by the decades-old synchronous DRAM architecture, which requires the memory controller to track the internal status of memory devices (chips) and schedule the timing of device operations. This rigidity has become an obstacle to integrating emerging memory technologies such as PCM into existing memory systems, because their timing requirements are vastly different. Furthermore, with the trend of embedding memory controllers into processors, it is crucial to have interoperability among general-purpose processors and diverse memory modules. To address this issue, we propose a new memory architecture framework called universal memory architecture (UniMA). It enables the interoperability by decoupling the scheduling of device operations from memory controller, using a bridge chip at each memory module to perform local scheduling. The new architecture may also help improve memory scalability, power efficiency, and bandwidth as previously proposed decoupled memory organizations. A major focus of this study is to evaluate the performance impact of local scheduling of device operations. We present a prototype implementation of UniMA on top of DDRx memory bus, and then evaluate its efficiency with different workloads. The simulation results show that UniMA actually improves memory system efficienc for memory-intensive workloads due to slightly increased parallelism among memory modules. The overall performance improvement over the conventional DDRx memory architecture is 3.1% on average. The performance of other workloads is reduced slightly, by 1.0% on average, due to a small increase of memory idle latency. In short, the prototype and evaluation demonstrate that it is possible to integrate diverse memory technologies into a single memory architecture with virtually no loss of overall performance.
Modeling and Performance Evaluation of TSO-Preserving Binary Optimization
Authors: Cheng Wang and Youfeng Wu (Intel Labs)
Abstract: Binary optimizations for programs running on multi-core systems must consider the processor memory consistency model for correctness and performance. This paper studies the valid binary optimizations in the popularly deployed Total Store Ordering (TSO) memory model. We first introduce a novel approach to formally model general valid binary optimizations in the formal TSO model. Then we provide a sound and complete algorithm to verify optimizations in the model. To evaluate the performance potential of valid optimizations in TSO and the benefits of HW supports on relaxing optimization constraints in TSO, we developed a binary optimization system and classified a set of well-known binary optimizations into three categories: 1) optimizations that are valid independent of memory model; 2) optimizations that are valid in TSO model; and 3) optimizations that are valid in TSO only with additional HW supports. We show in our experiments that, dynamic binary optimizations that are valid independent of memory models can improve binary program performance by 8.1%. Since TSO is `weaker' than other memory models, such as sequential consistency, valid memory optimizations in TSO can further improve the performance by 4.8% to a total 12.9%. HW supports can relax the optimization constraints imposed by TSO model and improve the overall performance to 20.4%.
No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs
Authors: Kishore Kumar Pusukuri (UC Riverside), Rajiv Gupta (UC Riverside), and Laxmi N. Bhuyan (UC Riverside)
Abstract: Efficient contention management is the key for achieving scalable performance for multithreaded applications running on multicore systems. However, contention management policies provided by modern operating systems increase context switches and lead to performance degradation for multithreaded applications under high loads. Moreover, this problem is exacerbated with the interaction between contention management policies and OS scheduling polices. Time Share (TS) is the default scheduling policy in a modern OS such as Solaris 10 and Linux and with TS policy, priorities of threads change very frequently for balancing load and providing fairness in scheduling. Due to this frequent ping-ponging of priorities, threads of an application are often preempted by the threads of the same application, which increases the frequency of involuntary context-switches, lock-holder thread preemptions, and leads to poor performance. This problem becomes very serious under high loads. To alleviate this problem, in this paper, we present a scheduling policy called faithful scheduling (FF), which dramatically reduces context-switches, lock-holder thread preemptions, and leads to high performance. We implemented FF on a 24-core Dell PowerEdge R905 server running Solaris 10 and evaluated it using 22 programs including the TATP database application, SPECjbb2005, programs from PARSEC, SPEC OMP, and some microbenchmarks. The experimental results show that FF policy achieves high performance for both lightly and heavily loaded systems, and moreover it does not require any changes in the application source code or the OS kernel.
OpenMDSP: Extending OpenMP to Program Multi-Core DSP
Authors: Jiangzhou He and Wenguang Chen (Tsinghua University), Guangri Chen (Huawei Technologies Co. Ltd.), Weimin Zheng and Zhizhong Tang (Tsinghua University), and Handong Ye (Huawei Technologies Co. Ltd.)
Abstract: Multi-core Digital Signal Processors (DSP) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing etc. Comparing with general purpose multi-processors, the multi-core DSPs normally have more complex memory hierarchy, such as on-chip core-local memory and non-cache-coherent shared memory. As a result, it is very challenging to write efficient multi-core DSP applications. The current approach to program multi-core DSPs is based on proprietary vendor SDKs, which only provides low-level, non-portable primitives. While it is acceptable to write coarse-grained task level parallel code with these SDKs, it is very tedious and error prone to write fine-grained data parallel code with them. We believe it is desired to have a high-level and portable parallel programming model for multi-core DSPs. In this paper, we propose OpenMDSP, an extension of OpenMP designed for multi-core DSPs. The goal of OpenMDSP is to fill the gap between OpenMP memory model and the memory hierarchy of multi-core DSPs. We propose three class of directives in OpenMDSP: (1) data placement directives allow programmers to control the placement of global variables conveniently; (2) distributed array directives divide whole array into sections and promote them into core-local memory to improve performance, and (3) stream access directives promote big array into core-local memory section by section during a parallel loop's processing. We implement the compiler and runtime system for OpenMDSP on FreeScale MSC8156. Benchmarking result shows that seven out of nine benchmarks achieve a speedup of more than 5 with 6 threads. Keywords: OpenMP, multi-core DSP, data parallelism, LTE
Optimizing Data Layouts for Parallel Computation on Multicores
Authors: Yuanrui Zhang, Wei Ding, Jun Liu, and Mahmut Kandemir (Penn State University)
Abstract: The emergence of multicore platforms offers several opportunities for boosting application performance. These opportunities, which include parallelism and data locality benefits, require strong support from compilers as well as operating systems. Current compiler research targeting multicores mostly focuses on code restructuring and mapping. In this work, we explore automatic data layout transformation targeting multithreaded applications running on multicores. Our transformation considers both data access patterns exhibited by different threads of a multithreaded application and the on- chip cache topology of the target multicore architecture. It automatically determines a customized memory layout for each target array to minimize potential cache conflicts across threads. Our experiments show that, our optimization brings significant benefits over state-of-the-art data locality optimization strategies when tested using 30 benchmark programs on an Intel multicore machine. The results also indicate that this strategy is able to scale to larger core counts and performs better with increased data set sizes.
Optimizing Regular Expression Matching with SR-NFA on Multi-Core Systems
Authors: Yi-Hua Edward Yang and Viktor K. Prasanna (University of Southern California)
Abstract: Conventionally, regular expression matching (REM) has been performed by sequentially comparing the regular expression (regex) to the input stream, which can be slow due to excessive backtracking (smith:acsac06). Alternatively, the regex can be converted to a deterministic finite automaton (DFA) for efficient matching, which however may require an extremely large state transition table (STT) due to exponential state explosion (meyer:swat71; yu:ancs06). We propose the segmented regex-NFA (SR-NFA) architecture, where the regex is first compiled into modular nondeterministic finite automata (NFA), then partitioned, optimized, and matched efficiently on modern multi-core processors. SR-NFA offers attack-resistant multi-gigabit per second matching throughput, does not suffer from either backtracking or state explosion, and can be rapidly constructed. For regexes causing exponential state explosion, the proposed SR-NFA is thousands of times faster to construct and update, use thousands of times less memory, and achieves ~2.5 Gbps matching throughput for thousands SR-NFA states on an 8-core 2.6 GHz Opteron platform.
PEPSC: A Power-Efficient Processor for Scientific Computing
Authors: Ganesh Dasika, Ankit Sethia, Trevor Mudge, and Scott Mahlke (University of Michigan)
Abstract: The rapid advancements in the computational capabilities of the graphics processing unit (GPU) as well as the deployment of general programming models for these devices have made the vision of a desktop supercomputer a reality. It is now possible to assemble a system that provides several TFLOPs of performance on scientific applications for the cost of a high-end laptop computer. While these devices have clearly changed the landscape of computing, there are two central problems that arise. First, GPUs are designed and optimized for graphics applications resulting in delivered performance that is far below peak for more general scientific and mathematical applications. Second, GPUs are power hungry devices that often consume 100-300 Watts, which restricts the scalability of the solution and requires expensive cooling. To combat these challenges, this paper presents the PEPSC architecture -- an architecture customized for the domain of data parallel scientific applications where power-efficiency is the central focus. PEPSC utilizes a combination of a two-dimensional single-instruction multiple-data datapath, an intelligent dynamic prefetching mechanism, and a configurable SIMD control approach to increase execution efficiency over conventional GPU solutions. A single PEPSC core has a peak performance of 120 GFLOPs while consuming 2W of power when executing modern scientific applications, which represents an increase in computation efficiency of more than 10X over existing solutions.
Performance per Watt Benefits of Dynamic Core Morphing in Asymmetric Multicores
Authors: Rance Rodrigues (U. Massachusetts at Amherst), Arunachalam Annamalai (U. Massachusetts at Amherst), Israel Koren (U. Massachusetts at Amherst), Sandip Kundu (U. Massachusetts at Amherst), and Omer Khan (U. Massachusetts at Lowell)
Abstract: The trend toward multicore processors is moving the emphasis in computation from sequential to parallel processing. However, not all applications are parallelizable and can benefit from many cores. Such applications lead to under-utilization of parallel resources, hence sub-optimal performance-per-watt. They may however, benefit from powerful uniprocessors. On the other hand, not all applications can take advantage of more powerful uniprocessors. To address competing requirements of diverse applications, we propose a heterogeneous multicore architecture with a dynamic core morphing capability. Depending on the computational demands of the currently executing applications, the resources of a few tightly coupled cores are morphed at runtime. We present a simple hardware- based algorithm to monitor the time-varying computational needs of the application and when deemed beneficial, trigger reconfiguration of the cores at fine-grain time scales to maximize the performance- per-watt of the application. The proposed dynamic scheme is then compared against a baseline static heterogeneous multicore configuration and an equivalent homogeneous configuration. Our results show that dynamic morphing of cores can provide performance-per-watt gains of 43% and 16%, when compared to the homogeneous and baseline heterogeneous configurations, respectively.
Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer
Authors: Nikolas Ioannou (University of Edinburgh), Matthias Gries and Michael Kauschke (Intel, Braunschweig, Germany), and Marcelo Cintra (University of Edinburgh)
Abstract: To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. Many-core architectures, such as the Single-chip Cloud Computer (SCC) experimental processor by Intel Labs, have DVFS infrastructures that scale by having many more independent voltage and frequency domains on-die than today's multi-cores. This paper proposes a novel, hierarchical, and transparent client-server power management scheme applicable to such architectures that tries to minimize energy consumption within a performance window taking into consideration not only the local information for cores within frequency islands but also information that spans multiple frequency islands and voltage domains. We implement our proposed hierarchical power control using a novel application driven phase detection and prediction approach for Message Passing Interface (MPI) applications, a natural choice on the SCC with its fast on-chip network and its non-coherent memory hierarchy. This phase predictor operates then as the front-end to the hierarchical DVFS controller, providing the necessary DVFS scheduling points. Experimental results with SCC hardware show that our approach provides significant improvement of the Energy Delay Product (EDP) of as much as 27.2%, and 11.4% on average, with an increase in execution time of 7.7% on average over a baseline version without DVFS. These improvements come from both improved phase prediction accuracy and more effective DVFS control of the domains, compared to existing approaches.
POPS: Coherence Protocol Optimization for Both Private and Shared Data
Authors: Hemayet Hossain, Sandhya Dwarkadas, and Michael C. Huang (University of Rochester)
Abstract: As the number of cores in a chip multiprocessor (CMP) increases, the need for larger on-chip caches also increases in order to avoid creating a bottleneck at the off-chip interconnect. Utilization of these CMPs include combinations of multithreading and multiprogramming, showing a range of sharing behavior, from frequent inter-thread communication to no communication. The goal of the CMP cache design is to maximize capacity for a given size while providing as low a latency as possible for the entire range of sharing behavior. In a typical CMP design, the last level cache (LLC) is shared across the cores and incurs a latency of access that is a function of distance on the chip. Sharing helps avoid the need for replicas at the LLC and allows access to the entire on-chip cache space by any core. However, the cost is the increased latency of communication based on where data is mapped on the chip. In this paper, we propose a cache coherence design we call POPS that provides localized data and metadata access for both shared data (in multithreaded workloads) and private data (predominant in multiprogrammed workloads). POPS achieves its goal by (1) decoupling data and metadata, allowing both to be delegated to local LLC slices for private data and between sharers for shared data, (2) freeing delegated data storage in the LLC for larger effective capacity, and (3) changing the delegation and/or coherence protocol action based on the observed sharing pattern. Our analysis on an execution-driven full system simulator using multithreaded and multiprogrammed workloads shows that POPS performs 42% (28% without microbenchmarks) better for multithreaded applications, 16% better for multiprogrammed applications, and 8% better for single thread applications compared to the base non-uniform shared L2 protocol. POPS has the added benefits of reduced on-chip and off-chip traffic and reduced dynamic energy consumption.
DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism
Authors: Byn Choi (UIUC), Rakesh Komuravelli (UIUC), Hyojin Sung (UIUC), Robert Smolinski (UIUC), Nima Honarmand (UIUC), Sarita V. Adve (UIUC), Vikram S. Adve (UIUC), Nicholas P. Carter (Intel) and Ching-Tsun Chou (Intel)
Abstract: For parallelism to become tractable for mass-market programmers, shared memory programming languages and environments must evolve to enforce disciplined practices that ban `wild shared-memory behaviors,' e.g., unstructured parallelism, arbitrary data races, and ubiquitous non-determinism. This software evolution is a rare opportunity for hardware designers to rethink hardware from the ground up to exploit opportunities exposed by such disciplined software models. Such a co-designed effort is more likely to achieve manycore scalability and large gains in power/performance than a software-oblivious hardware evolution. This paper describes SystemX, a hardware architecture motivated by these observations. We show how a disciplined parallel programming model can greatly simplify cache coherence and consistency, and enable a more efficient communication and cache architecture. Using a model checking tool for protocol verification, the SystemX coherence protocol has about 15x fewer reachable states than a state-of-the-art implementation of the widely used MESI protocol. We also show that the coherence protocol is flexible and extensible by incorporating two sophisticated optimizations (which MESI implementations usually do not incorporate because of the added complexity): flexible bulk transfers and cache-to-cache direct data transfers. Together, these seamlessly integrate message passing-like interactions within a shared memory programming model, for improved design complexity and performance efficiency.
SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores
Authors: Sangmin Seo, Junghyun Kim, and Jaejin Lee (Seoul National University)
Abstract: As parallel programming has become the mainstream due to multicore processors, dynamic memory allocators used in C and C++ can suppress the performance of multi-threaded applications if they are not scalable. In this paper, we present a new dynamic memory allocator for multi-threaded applications. The allocator never uses any synchronization for common cases. It uses only lock-free synchronization mechanisms for uncommon cases. Each thread owns a private heap and handles memory requests on the heap. Our allocator is completely synchronization-free when a thread allocates a memory block and deallocates it by itself. Synchronization-free means that threads do not communicate with each other at all. On the other hand, if a thread allocates a block and another thread frees it, we use a lock-free stack to atomically add it to the owner thread's heap to avoid the memory blowup problem. Furthermore, our allocator exploits various memory block caching mechanisms to reduce the latency of memory management. Freed blocks or intermediate memory chunks are cached hierarchically in each thread's heap and they are used for future memory allocation. We compare the performance and scalability of our allocator to those of well-known existing multi-threaded memory allocators using eight benchmarks. Experimental results on a 48-core AMD system show that our approach achieves better performance than other allocators for all benchmarks and is highly scalable with a large number of threads.
SPATL: Honey, I Shrunk the Coherence Directory
Authors: Hongzhou Zhao (University of Rochester), Arrvindh Shriraman (Simon Fraser University), Sandhya Dwarkadas (University of Rochester), and Vijayalakshmi Srinivasan (IBM Research)
Abstract: One of the key scalability challenges of on-chip coherence in a multicore chip is the coherence directory, which provides information on sharing of cache blocks. Shadow tags that duplicate entire private cache tag arrays are widely used to minimize area overhead, but require an energy-intensive associative search to obtain the sharing information. Recent research proposed a tagless directory, which uses bloom filters to summarize the tags in a cache set. The tagless directory associates the sharing vector with the bloom filter buckets to completely eliminate the associative lookup. While directory overhead compared to shadow tags is reduced by the # of cache ways, tagless still uses a full map sharing vector to represent the sharing information, resulting in remaining area and energy challenges with increasing core counts. In this paper, we first show that due to the regular nature of applications, many bloom filters essentially replicate the same sharing pattern. We next exploit the pattern commonality and propose SPATL (SPATL is pronounced as Spatial) (Sharing-pattern based Tagless Directory). SPATL exploits the sharing pattern commonality to decouple the sharing patterns from the bloom filters and eliminates the redundant copies of sharing patterns. SPATL works with both inclusive and non-inclusive shared caches and provides 34% storage savings over tagless, the previous most storage-optimal directory, at 16 cores. We study multiple strategies to periodically eliminate the false sharing that comes from combining sharing pattern compression with tagless, and demonstrate that SPATL can achieve the same level of false sharers as tagless with ~5% extra bandwidth. Finally, we demonstrate that SPATL scales even better than an idealized directory and can support 1024- core chips with less than 1% of the private cache space for data parallel appications.
Speculative Parallelization in Decoupled Look-ahead
Authors: Alok Garg, Raj Parihar, and Michael C. Huang (University of Rochester)
Abstract: While a canonical out-of-order engine can effectively exploit implicit parallelism in sequential codes, its effectiveness is often hindered by instruction and data supply imperfections manifested as branch mispredictions and cache misses. Accurate and deep look-ahead guided by a slice of the executed program is a simple and effective approach to mitigate the performance impact of mispredictions and cache misses. Unfortunately, program slice-guided look-ahead is often limited by the speed of the look-ahead code slice, especially for irregular codes. In this paper, we attempt to speed up the look-ahead agent using speculative parallelization, which is especially suited for the task. First, slicing for look-ahead tends to reduce important data dependencies that prohibit successful speculative parallelization. Second, the task for lookahead is not correctness-critical and thus naturally tolerates dependence violations. This enables an implementation to forgo violation detection altogether, simplifying architectural support tremendously. In a straightforward implementation, incorporating speculative parallelization to the look-ahead agent further improves system performance by up to 1.39x with an average of 1.13x.
STM2: A Parallel STM for High Performance Simultaneous Multithreading Systems
Authors: Gokcen Kestor (Barcelona Supercomputing Center), Roberto Gioiosa (Barcelona Supercomputing Center), Tim Harris (Microsoft Research), Osman Unsal (Barcelona Supercomputing Center), Adrian Cristal (Barcelona Supercomputing Center), Ibrahim Hur (Barcelona Supercomputing Center), and Mateo Valero (Univesitat Politecnica de Catalunya)
Abstract: Extracting high performance from modern chip multithreading (CMT) processors is a complex task, especially for large CMT systems. Programmers must efficiently parallelize performance-critical software while avoiding deadlocks and race conditions. Transactional memory (TM) is a promising programming model that allows programmers to focus on parallelism rather than maintaining correctness and avoiding deadlock. Software-only implementations (STMs) are especially compelling because they run on commodity hardware, therefore providing high portability. Unfortunately, STM systems usually suffer from high overheads, which may limit their usage especially at scale. In this paper we present STM^2, a novel parallel STM designed for high performance, aggressive multithreading systems. STM^2 significantly lowers runtime overhead by offloading read-set validation, bookkeeping and conflict detection to auxiliary threads running on sibling hardware threads. Auxiliary threads perform STM operations in parallel with their paired application threads and absorb STM overhead, significantly improving performance. We exploit the fact that, on modern multi-core processors, sets of cores can share L1 or L2 caches. This lets us achieve closer coupling between the application thread and the auxiliary thread (when compared with a traditional multi processor). Our results, performed on an IBM POWER7 machine, a state-of-theart, aggressive multi-threaded system, show that our approach outperforms several well-known STM implementations. In particular, STM^2 shows speedups between 1.8x and 5.2x over the tested STM systems, on average, with peaks up to 12.8x.
StVEC: A Vector Instruction Extension for High Performance Stencil Computation
Authors: Naser Sedaghati, Renji Thomas, Louis-Noel Pouchet, Radu Teodorescu, and P Sadayappan (The Ohio State University)
Abstract: Stencil computations comprise the compute-intensive core of many scientific applications. The data access pattern of stencil computations often requires several adjacent data elements of arrays to be accessed in innermost parallel loops. Although such loops are vectorized by current compilers like GCC and ICC that target short-vector SIMD instruction sets, a number of redundant loads or additional intra-register data shuffle operations are required, reducing the achievable performance. Thus, even when all arrays are cache resident, the fraction of peak performance achieved with stencil computations is considerably lower than machine peak. In this paper, we present a hardware-based solution for this problem. We propose an extension to the standard addressing mode of vector floating-point instructions in ISAs such as SSE, AVX, VMX etc. Focusing specifically on the SSE ISA, we propose an extended mode of paired-register addressing and its hardware implementation, to overcome the performance limitation of current short-vector SIMD ISA's for stencil computations. Further, we present a code generation technique that can be used by a vectorizing compiler when such instructions are implemented in the processor. Using an optimistic as well as a pessimistic emulation of the proposed instruction extension, we demonstrate the effectiveness of the proposed approach on top of SSE and AVX capable processors. We also synthesize parts of the proposed design using a 45nm CMOS library to show it would have minimal impact on processor cycle time.
Using a Reconfigurable L1 Data Cache for Efficient Version Management in Hardware Transactional Memory
Authors: Adria Armejach and Azam Seyedi (Barcelona Supercomputing Center, Universitat Politecnica de Catalunya), Ruben Titos (Universidad de Murcia), Ibrahim Hur, Osman S. Unsal, and Adrian Cristal (Barcelona Supercomputing Center), and Mateo Valero (Barcelona Supercomputing Center, Universitat Politecnica de Catalunya)
Abstract: Transactional Memory (TM) potentially simplifies parallel programming by providing atomicity and isolation for executed transactions. One of the key mechanisms to provide such properties is version management, which defines where and how transactional updates (new values) are stored. Version management can be implemented either eagerly or lazily. In Hardware Transactional Memory (HTM) implementations, eager version management puts new values in-place and old values are kept in a software log, while lazy version management stores new values in hardware buffers keeping old values in-place. Current HTM implementations, for both eager and lazy version management schemes, suffer from performance penalties due to the inability to handle two versions of the same logical data efficiently. In this paper, we introduce a reconfigurable L1 data cache architecture that has two execution modes: a 64KB general purpose mode and a 32KB TM mode which is able to manage two versions of the same logical data. The later allows to handle simultaneously old and new transactional values within the cache when executing transactional workloads. We explain in detail the architectural design and internals of this reconfigurable data cache (RDC) as well as the supported operations that allow to efficiently solve existing version management problems. We describe how the RDC can support both eager and lazy HTM systems, and we present two RDC-HTM designs. Our evaluation shows that the Eager-RDC-HTM and Lazy-RDC-HTM systems achieve 1.36x and 1.18x speedup, respectively, over state- of-the-art proposals. We also evaluate the area and energy effects of our proposal, and we find that RDC designs are 1.92x and 1.38x more energy-delay efficient compared to baseline HTM systems, with less than 0.3% area impact on modern processors.