# Parallel Programming Platforms Alexandre David B2-206 http://www.cs.aau.dk/~adavid/teaching/MTP-06/ - Implicit Parallelism (2.1) - Limitations of Memory System Performance (2.2) - Dichotomy of Parallel Computing Platforms (2.3) Alexandre David, MVP'06 2 I will follow the book, skipping sections from time to time but the order will be respected. References to sections are given on the web and reminded during the lectures (plan may change). Lectures are intended to teach you and help you read the book. - Bottlenecks in computers: - Processor - Memory - Datapath - Addressed with multiplicity. - Parallelization not solution to everything - Sub-optimal serial code bad - Optimize serial first (similar characteristics) Alexandre David, MVP'06 3 #### Bottlenecks of different kinds. - •Processor: less and less, though can be depending on some bad behaviors (branch miss prediction in large pipelines). - •Memory: more and more considering the speed gap between processor and memory, the problem being how to feed the processor with data so that it does not stay idle. - •Datapath: depends on programs and architecture, linked to previous ones. Motivations for optimizing serial programs and why we talk about implicit parallelism: Sub-optimal serial code exhibits **unreliable** and **misleading** behaviors. Undesirable effects coming from poor cache utilization, bad branch prediction, etc ... that may become even worse in a parallel context (distribute data, synchronize, etc ...). Similar characteristics in serial programs with intrinsic parallelism of modern processors (pipelines). Understanding architecture is the first step to good programming. #### Trends in Microprocessors - Processor speed increase exponentially - More and more transistors: How to use them wisely? - Multiple functional units run multiple instructions in the same clock cycle: superscalar processors. - How to select and execute instructions? 7-02-2006 Alexandre David, MVP'06 4 Already mentioned: Moore's law power x2 every 18 month, but for processor only. Global performance plagued by memory abysmal performance. Higher level of integration poses the question of how to best utilize transistors. Functional units: MMU, FPU, etc usually part of the marketing buzz of microprocessor companies. How to? That's the different architectures, not going into details but basically all processors are +/- RISC processors with a translation from assembler to microcode. Then most have branch prediction. #### Pipelining and Superscalar Execution - Pipeline idea: overlap stages in instruction execution. - Example of car factory. - The good: higher throughput. - The bad: penalty of branch miss prediction. - Multiple pipelines: several functional units. 7-02-2006 Alexandre David, MVP'06 5 Overlapping stages: cut instructions in small pieces, one per cycle, and try to occupy all the stages. But why after all? Not better to have a super powerful one stage-do-all? Car factory: Imagine a fast factory where it takes 12h to complete one car. If there is one unit doing all, then it will be busy all the time and throughput would be 1 car per 12h. How to improve? Buy 11 other full scale units? Super expensive! Cut the bit unit in 12 smaller parts, 1h per part, every car needs 12h but throughput is 1 car per hour. 12x faster, cost efficient. Branch prediction: try to keep the pipeline busy by filling it ahead, but if did it wrong, then need to flush it (and loose all the computations). P4: 20 stages, miss prediction means loose 20 cycles. Dual issue or two-way superscalar execution. IF: Instruction Fetch. ID: Instruction Decode. OF: Operand Fetch. E: Instruction Execute. WB: Write back. NA: No Action. Note: begin to execute 6<sup>th</sup> instruction at 4<sup>th</sup> clock cycle. ## Pipelining and Superscalar Execution - 4 - Imagine another ordering (or factorization by the compiler): different performance. - Resolve data dependency. - Reordering by CPU possible (out-oforder execution). - Resource dependency. 7-02-2006 Alexandre David, MVP'06 7 Data dependency: needs previous results in order to continue computations. A=(B+C)\*D, we need B+C before computing \*D. Resource dependency: needs functional units. A=B\*C+C\*D+D\*E+E\*F+F\*G, obviously not all \* can be done in parallel because of lack of functional units. Most processors are capable of out-of-order execution, not Xbox 360. Intrinsic parallelism: Pipeline (multiple stages) & multiple functional units (superscalar) implement parallelism. Modern processors: 4 way superscalar, 10-20 stage pipeline. More explanation for fig 2.1. - Bundle instructions together to simplify the superscalar scheduler. - IA64 (Itanium) is an example. - Problems: - Rely a lot on the compiler. - Limited parallelism (not dynamic). Alexandre David, MVP'06 10 #### Very Long Instruction Word Processors! Superscalar schedulers are complex and expensive (transistors). VLIW design idea is to rely on the compiler to bundle instructions together, so that the scheduler becomes very simple. - The memory system is most often the bottleneck. - Performance captured by - latency and - bandwidth. - Remark: In practice latency is complicated to define: CL2, CL3, 2-2-2-5,... Alexandre David, MVP'06 11 The problem is most often how to feed the processor with continuous data so that it does not **stall**. Latency is the time from the issue of a memory request to the time the data becomes available to the processor. Bandwidth is the rate at which data can be pumped to the processor. Example: water hose. Latency: time before first drop of water comes out. Bandwidth: rate flow of water. - Processor @1GHz (1ns cycle), DRAM with 100ns latency, capable of executing 4 IPC. - 4 IPC @1GHz -> 4GFLOPS peak rating. - Processor must wait 100 cycles for every request. - Vector operations (dot product) @10MFLOPs. Alexandre David, MVP'06 12 No cache in this example to simplify. It is still general enough since we can consider first access to some memory and take cache miss into account. ALU: arithmetic and logical unit. FPU: floating point unit. Here absolute worst case scenario but still we loose a factor 100 in performance. #### Improving with Cache - Note: Often "\$\$" on pictures (cash). - Hierarchical memory architecture with several levels of cache (2 common). - Instruction and data separate for L1. - Low latency, high bandwidth, but small. - Why does it improve??? 7-02-2006 Alexandre David, MVP'06 13 Common: Athlon 64 64K+64K L1, 1M L2. Pentium 4 more complicated NetBurst with execution trace cache (12K) and 16K L1, with 1M L2. Now you have to think some time about why it helps. You know about cache hit ratio, cache miss, at least you've heard about it. ### Why is \$\$ good? - Temporal locality - Repeated access to the **same** data in a small window of time. - Spatial locality - Consecutive data accessed by successive instructions. - Vital assumptions, almost always hold. - Very important for parallel computing. 7-02-2006 Alexandre David, MVP'06 14 REMEMBER these two! They are common to almost all programs and are vital to cache performance. For parallel computing, even more important: apart from the aggregate higher amount of cache that must be used wisely, we have more penalty for moving data around processors (or processor nodes). That also explains the model numbering for AMD processors. © ### Matrix Multiplication Example - Common example, will be used many times in the course. - C=A\*B, where A (n\*m), B (p\*n), and C (p\*m) are matrices. $$c_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}$$ 7-02-2006 Alexandre David, MVP'06 15 Compatible dimensions required. In practice n\*n. Re-use: spatial and temporal localities. Intuitively: $n^3$ accesses on 2x $n^2$ matrices (if $n^*n$ ). #### Cache Characteristics - Hit ratio (behavior): fraction of references satisfied by the cache. - Cache line (= bus width): granularity. - Associativity (architecture): "collision list" to reduce cache eviction. - For the matrix: 2n<sup>2</sup> fetches from memory to populate the cache, and then n<sup>3</sup> direct accesses at full speed. 7-02-2006 Alexandre David, MVP'06 18 Data re-use is the keyword. Cache line: word granularity is too expensive and bad for spatial locality. 4 words usually for L2 (access to system bus), and internally 256-bit data bus for L1<->L2 (8 words). The term *cache eviction* is not mentioned in the book and is missing. It is an important notion to know. - Access to successive words much better than random access. - Higher bandwidth (whole cache line at once) - Better latency (successive words already in cache) Alexandre David, MVP'06 19 - a) Add vectors (temporary results) to get final result. - b) Compute final result incrementally.Strided access (a) yields poor performance. # Other Approaches to Hide Latency - Prefetching - but may evict useful data because cache is small. - Multi-threading - but needs higher bandwidth because all the threads share the same bus. 7-02-2006 Alexandre David, MVP'06 21 Prefetching is like fetching the next cache line when possible (hardware decides), or same effect by reordering instructions (hardware or compiler) to issue loads long before usage. Works if consecutive words are accessed by consecutive instructions (spatial locality). Multi-thread: switch to another thread when a thread stalls for data and keep the processor busy. In fact, both solutions address the latency problem and exacerbate the bandwidth problem. That was probably the design idea behind RAMBUS, though no multi-threading at the time to use it! + the fact that latency was way higher than other systems. Software: need to create the threads explicitly. Hardware: need support for multi-thread. ### Summary on Memory - Exploit spatial and temporal locality in programs. For sequential and parallel programs! - Operations/memory accesses is a good indicator of tolerance to memory bandwidth. - Processing is cheap, memory is expensive. 7-02-2006 Alexandre David, MVP'06 23 - Logical organization: programmer's view. - Physical organization: actual hardware. - Two critical components: - expressing parallel tasks (control structure) - specifying interaction between them (communication model). Alexandre David, MVP'06 24 The 2 critical components both through logical and physical organizations. #### **Control Structure** - Parallelism can be expressed at different levels of granularity - from instruction level parallelism - to processes. - SIMD: single instruction stream, multiple data stream. - MIMD: multiple instruction stream ... 7-02-2006 Alexandre David, MVP'06 25 Processing units in parallel computers either operate under the centralized control of a single control unit or work independently. - •SIMD: A single control unit dispatches the same instruction to various processors or functional units. - •MIMD: Each processor has its own control unit and can execute different instructions on different data items. PE: processing element. Typically, SIMD implemented as special instruction sets on processors (MMX, SSE, SSE2, 3DNow!), and MIMD implemented as multiprocessor machines or clusters. SIMD relies on the regular structure of computations (such as those in image processing). A variant of MIMD, called single program multiple data streams (SPMD) executes the same program on different processors, which is often the case in practice. - Memory shared between several processors. - NUMA different access time - UMA same access time. - Cases with local cache considered UMA. - Easier programming, one address space - but cache coherence mechanisms needed, - But need to solve contention (writes). Alexandre David, MVP'06 27 NUMA: non-uniform memory access. Cheaper and easier to implement. Need locality to perform well. UMA: uniform memory access. Performant uniform access expensive. ## Message-Passing Platforms - Memory private to processors. - Interaction via messages - Send/receive primitives. - MPI libraries. - Hardware needed: good network interconnect. 7-02-2006 Alexandre David, MVP'06 30 Cheap and popular solution: cluster of MP machines connected via high bandwidth network.