

Alexandre David 1.2.05 adavid@cs.aau.dk



#### Notion of Parallelism

- Familiar concept
  - ex: building a house, car manufacturing
  - decomposition into tasks
  - task dependency
  - static decomposition
- Less familiar concept
  - dynamic decomposition
  - dynamic load balancing
  - synchronization



- Instruction-level parallelism
  - independent instructions run in parallel
- Super-scalar CPUs
  - different execution units
- Pipelines
  - cut instructions in small steps executed by different states and keep all the stages busy.

And other techniques: OO exec, prefetch...

## Pipelining and Superscalar Execution - Example

#### Compiler

c=a+b+c+d

as

c = (a+b)+(c+d)

#### CPU

- 1. load R1,@1000
- 2. load R2,@1008
- 3. addR1,@1004
- 4. addR2,@100C
- 5. add R1,R2
- 6. store R1,@2000

#### Instruction cycles



2x IF, ID, OF, ... in the same cycle: superscalar.



## The Change in Thinking

- So far, programmer got the benefits of implicit parallelism for free.
  - No paradigm changes, profit from Moore's law.
  - Sequential algorithms, sequential reasoning, sequential programming, the hardware keeps it sequential.
  - We are TOO used to that.
- Multi-cores
  - No implicit parallelism anymore.
  - We need to change habits.



- No existing bullet-proof
  - language for parallelism,
  - methodology & technique for parallelism.
- Existing programs cannot exploit multi-cores.
- Programmers do not how to write parallel programs in general.
- Algorithms are generally sequential and not fit for parallel programming directly.
- Write parallel & scalable programs to use more cores efficiently "for free".



- Special case
  - parallel by design from the start
  - graphical processing pipelined
  - so far application specific
  - try to generalize now but still, SIMD type of computations, computation intensive applications.



- Home PC, stations.
- Supercomputers history
  - but still powerful shared memory machines
     Vega 3 Series from Azul Systems up to
     864 cores, 768GB RAM, for Java applications.
- Clusters popular, cheap, scalable.
- Grid computing.



### Parallel vs. Distributed

- "Parallel": "parallel in the small".
  - shared memory, multi-cores
- "Distributed": "parallel in the big".
  - clusters, different machines
- Can have both of course.



## System Level Parallelism

- See PSS.
- Not a solution, limited to the number of tasks you have.
- Important to know:
  - bound threads scheduled by the OS
  - unbound threads scheduled by a library.



- What compilers do:
  - change ordering,
  - remove redundancy,
  - reallocate resources,
  - but keep the semantics of the sequential program.
  - They preserve the original algorithm.
- We need to design parallel algorithms, compilers are intrinsically limited.
  - Automatic algorithmic transformations are
     Imited.
     MVP'11 Aalborg University

## Example

- Sequential description.
- Need to reformulate to get it parallel.
- Still simple because additions are associative.
- However, operations on double are approximate. The ordering matters for the precision.

```
sum = 0
for(i=0; i<n; ++i)
{
   sum += x[i]
}</pre>
```







What if it's not +?

## -

#### **Prefix Sum**

- Useful primitive, aka scan.
- Input: sequence of x<sub>i</sub>.
- Output: sequence of  $y_i$ .  $y_i = sum_{j \le i} x_j$
- How to parallelize?



```
For n > 0:
y[0]=x[0]
for(i=1; i<n; ++i)
{
   y[i]=y[i-1]+x[i]
}</pre>
```

# Pr

### Prefix Sum - sum



# 4

## Prefix Sum - prefix



## **Prefix Computation 2**

```
function prefix<sup>+</sup>(A,n)
          B[1] := A[1]
       if n > 1 then
              for i = 1 to n/2 pardo
                         C[i]:=A[2i-1]+A[2i]
              od
              D:=prefix^{+}(C,n/2)
              for i = 1 to n/2 pardo
                         B[2i]:=D[i]
              od
              for i = 2 to n/2 pardo
                         B[2i-1]:=D[i-1]+A[2i-1]
              od
       prefix+:=B
end
```

## Parallel Prefix Computation 2

```
function prefix(A,n)[p_1,...,p_n]
       p<sub>1</sub>: B[1] := A[1]
       if n > 1 then
               for i = 1 to n/2 pardo
                       p_i: C[i]:=A[2i-1]+A[2i]
               od
               D:=prefix^{+}(C,n/2)[p_{1},...,p_{n/2}]
               for i = 1 to n/2 pardo
                       p_i: B[2i]:=D[i]
               od
               for i = 2 to n/2 pardo
                       p_i: B[2i-1]:=D[i-1]+A[2i-1]
               od
       prefix+:=B
end
```



## **Prefix Computation 2**





- What have we done here?
- Is it scalable?
  - What does it mean?
- Is it efficient?
- Is it optimal?
  - What does it mean?
- What about correctness?
- It is cache friendly?
- ...



## Concept of a Thread (PSS)

- Thread of execution
- Private
  - program
  - stack
  - program counter
- Shared
  - memory
  - I/O

### **Example of Execution Platform**





#### **Execution Platform of the Book**





# Limitations of Memory System Performance

- The memory system is most often the bottleneck.
- Performance captured by
  - latency and
  - bandwidth.
- Remark: In practice latency is complicated to define: CL2, CL3, 2-2-5,...



## Effect on Performance: Example

- Processor @1GHz (1ns cycle) capable of executing 4 IPC + DRAM with 100ns latency.
- 4 IPC @1GHz -> 4GFLOPS peak rating.
- Processor must wait 100 cycles for every request.
  - Vector operations (dot product) @10MFLOPs.



### Improving with Cache

- Note: Often "\$\$" on pictures (cash).
- Hierarchical memory architecture with several levels of cache (2 common).
- Instruction and data separate for L1.
- Low latency, high bandwidth, but small.
- ? Why does it improve performance???



- Temporal locality
  - Repeated access to the same data in a small window of time.
- Spatial locality
  - Consecutive data accessed by successive instructions.
- Vital assumptions, almost always hold.
- Very important for parallel computing.

## 1

## Case Study: Count 3s

```
int count3s(int *array, int length)
{
     int count = 0;
     for(i = 0; i < length; ++i)
     {
         if (array[i] == 3) count++;
     }
     return count;
}</pre>
```

Serial code → parallel code?



- Partition the input
  - static data partitioning
- Shared variable counter





Thread creation ———

Partitioning \_\_\_\_\_

Count -----

```
/* number of threads */
     int t;
     int *array;
     int length;
     int count;
     void count3s()
        int i;
       count = 0;
       /* Create t threads */
        for(i=0; i<t; i++)
 11
 12
 13
          thread create(count3s thread, i);
 14
                       (wait for the threads missing)
 15
 16
        return count;
 17
 18
     void count3s thread(int id)
 20
        /* Compute portion of the array that this thread
 21
           should work on */
 22
           int length per thread=length/t;
           int start=id*length per thread;
 23
 24
 25
        for(i=start; i<start+length per thread; i++)</pre>
 26
 27
         if(array[i]==3)
 28
                             Not atomic \rightarrow race
 29
           count++;
 30
 31
N 32 }
```



read count

count: 0

inc local value

read count

write count

count: 1

inc local value

count: 1

write count

Expected: 2

Race: The result depends on the interleaving of the threads. It is unpredictable.



## Try 2: Make It Atomic

- Use a mutex
  - locked
  - unlocked
- Mutexes are used to define critical sections

## Try 2

```
mutex m;
    void count3s thread(int id)
 4
      /* Compute portion of the array that this thread
 5
         should work on */
      int length per thread=length/t;
      int start=id*length per thread;
      for(i=start; i<start+length_per_thread; i++)</pre>
10
11
        if(array[i]==3)
12
          mutex_lock(m);
13
14
          count++;
15
          mutex unlock(m);
16
17
18
```



### Correct But Abysmal Performance



## Try 3

```
private count[MaxThreads];
   mutex m;
3
   void count3s thread(int id)
5
      /* Compute portion of array for this thread to
         work on */
      int length per thread=length/t;
 8
      int start=id*length_per_thread;
 9
10
      for(i=start; i<start+length per thread; i++)</pre>
11
12
        if(array[i] == 3)
13
14
          private count[id]++;
15
16
17
      mutex lock(m);
18
      count+=private_count[id];
      mutex unlock(m);
19
20
```



### Better But Still Not Good







#### Core i7



## False Sharing

- Caches have some granularity = cache line.
- Usually on several words, 2-4 words.
- Consecutive counters are not logically shared but they are physically shared on the same cache line.
- The cache coherence protocol kicks in and kills performance because the line is constantly moving.



## Cache Coherence Protocols

- We need additional hardware to keep multiple copies of the same memory bank consistent with each other.
- We have seen that \$\$ is good but it does not come for free.
- Mechanism known as cache coherence protocol, usually described as state machines.



**Figure 2.21** Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b) Update protocol for shared variables.



### Try 4: Pad the Counters

```
struct padded int
 2
      int value;
      char padding[60];
 4
    } private count[MaxThreads];
 6
    void count3s thread(int id)
 8
 9
      /* Compute portion of the array this thread should
         work on */
      int length per thread=length/t;
10
11
      int start=id*length per thread;
12
13
      for(i=start; i<start+length per thread; i++)</pre>
14
15
        if(array[i] == 3)
16
17
          private count[id]++;
18
19
20
      mutex lock(m);
21
      count+=private count[id].value;
22
      mutex unlock(m);
23
```



### Correct and Good



Limitation of the hardware



## Confirmation of the Memory Bandwidth Limit



no 3 in the array



- Correctness
- Performance
- Scalability
- Portability