#### **Processor Architecture I**

#### Alexandre David



### Overview

Introduction: from transistors to gates.

- and from gates to circuits.
- **4.1**
- **4.**2
- Micro+macro code.

# **Evolution of Computers**

#### Early systems

- CPU (*central* processing unit) controlled the entire system
- Responsible for I/O, computations, ...
- Modern computers
  - Decentralized architecture
  - Processors distributed (I/O)
  - CPU still controls other processors

# **General Purpose CPU**

- Very complex because
  - designed for wide variety of tasks multiple roles
  - contains special purpose sub-units
  - ex: core i7 has 731M transistors
  - supports protection and privileges (OS/applic.)
  - supports priorities (I/O)
  - data size (32/64-bit registers)
  - high speed parallelism = replication

# **CPU Visible State**

- Visible for ISA, used by compiler (& assembler) there may be other registers etc... that depend on the CPU generation.
- Registers (classify them), condition codes, status & memory.
- Memory=array of bytes (abstraction).



|     | ISA            | Byte                       | 0           | 1     | 2 3  | 3 4 | 5 |
|-----|----------------|----------------------------|-------------|-------|------|-----|---|
|     | IJA            | halt                       | 0 0         |       |      |     |   |
| - ( | lassify:       | nop                        | 1 0         |       |      |     |   |
|     | Ioad/store     | rrmovl rA, rB              | 2 0         | rA rB |      |     |   |
|     |                | irmovl V, rB               | 3 0         | F rB  |      | V   |   |
|     | r/i/m operands | <pre>rmmovl rA,D(rB)</pre> | 4 0         | rA rB |      | D   |   |
|     | arithmetics    | mrmovl D(rB),rA            | 5 0         | rA rB |      | D   |   |
| •   | jumps          | OPl rA, rB                 | 6 <b>fn</b> | rA rB |      |     |   |
| •   | (cmov)         | jxx <b>Dest</b>            | 7 <b>fn</b> |       | Dest | t   |   |
|     | call/return    | cmovXX rA, rB              | 2 <b>fn</b> | rA rB |      |     |   |
|     | stack          | call <b>Dest</b>           | 8 0         |       | Dest | t   |   |
| E   | incoding:      | ret                        | 9 0         |       |      |     |   |
|     | opcode+        | pushl <b>rA</b>            | A 0         | rA F  |      |     |   |
|     | operands       | popl rA                    | в 0         | rA F  |      |     |   |

CART - Aalborg University

12-04-2011

### ISA – Notes

- No memory  $\rightarrow$  memory transfer.
- No imm  $\rightarrow$  memory transfer (x86 can do it).
- Restricted operators (add, sub, and, xor).
  - Only register register operands.
  - Typical of load/store architectures (also RISC).
- Conditional jumps depend on combinations of flags.
  - Similar for conditional move.

 Call, ret, pop, and push implicitly modify the stack (& stack pointer).

# Encoding



1 byte encoding = code + function.
 Unique combination for every instruction.

Register operands have a unique identifier.

- eax:0, ecx:1,... none:F
- "none" important for design



### Status Code

#### State of the processor

- AOK normal
- HLT halted
- ADR invalid address
- INS invalid instruction

# Y86 – X86

- Y86 simplified model for X86.
- Code is similar, except for
  - move instructions
  - restrictions
- $\rightarrow$  may need more instructions
  - not important, we abstract from that.
  - Reason on Y86 level, exercise with both (simulator Y86, gcc for X86).



# Y86 Assembly

- Instructions as described with registers.
- Assembly directives.
  - Where to put the code (.pos).
  - Align the code (.align).
  - Declare data (.long).
  - X86 has more.
- Label declarations (used for jump offsets).
- Assembled into bytes.
- Y86 interpret the bytes.

# Logic Design

- How to implement the hardware to recognize the instruction codes.
- Logic that
  - reads bytes,
  - interprets bytes (switch),
  - performs operations,
  - updates state.
- Transistors  $\rightarrow$  gates  $\rightarrow$  functions & blocks.
- Processor design at block level.

# Background

Voltage: difference of potentials.

- Vcc ground (=0).
- Volts (V)
- Current: flow of electrons.
  - Amperes (A)
- Ohm's law: U = RI
- Dissipated power:  $P = UI = U^2/R$

# **Typical Chips**

- Operate on low voltage (5V, less for processors) – see power dissipation.
- Always 2 lines
  - ground (0V)
  - power (5V)

Diagrams usually omit ground and power.





# **Boolean Algebra**

- Mathematical basis for digital circuits.
- From boolean functions to gates.
- Basic functions: and, or, not.
- In practice, cheaper to have nand & nor.

| А | в | A and B | А | в | A or B | А | not A |
|---|---|---------|---|---|--------|---|-------|
| 0 | 0 | 0       | 0 | 0 | 0      | 0 | 1     |
| 0 | 1 | 0       | 0 | 1 | 1      | 1 | 0     |
| 1 | 0 | 0       | 1 | 0 | 1      |   |       |
| 1 | 1 | 1       | 1 | 1 | 1      |   |       |

### Example: Not





#### Primitive boolean functions.

Level of abstraction on integrated circuits.



#### Symbols used in circuits.

# Logic Gate Technology

- Transistor-transistor technology (TTL)
  - connect directly gates together to form boolean functions



#### and function

# **Design of Functions**

- Find a boolean expression that does what you need
  - and feed it to a tool that optimizes it to minimize the number of gates.
- Come up with the truth table of your function
  - which is converted to a boolean function.

### **Truth Table**



# **Combinatorial Circuits**



- Outputs = function(inputs)
  - change outputs only when inputs changes
  - need states to perform sequences of operations without sustained inputs
    - maintain states
    - use a clock

## **Practical Concerns**

Power

- consumption: how to feed
- dissipation P=CFV<sup>2</sup> (C: capacitance, F: frequency) how not to burn
- Timing gates need time to settle.
- Clock synchronization.
  - Update upon rise or fall of clock signal.

### **Clock Skew**



Signals need time to propagate. Local clocks are used on larger systems  $\rightarrow$  need to synchronize them.

The speed of light is too slow.

# Logic Design & HCL

- Design logic with gates but not one by one and not manually!
  - Use an adapted language for that. Here HCL (hardware control language) for educational purposes.
  - C-like language to express boolean formulas.
  - Combinatorial circuits built out of these formulas.
  - Acyclic network of gates: signal propagates from inputs to outputs → boolean functions.



#### Example

#### bool eq = (a && b) || (!a && !b);





## Multiplexor

 Function: choose an input signal depending on a selection signal.
 bool out = (s && a) || (!s && b);

Select results, functions, etc...



#### Word-Level Combinatorial Circuits

Operations defined at word level (~integers).

- Treat groups of bits together.
- Define functions at word-level.

### **Example: Equality Test**

#### bool Eq = (A == B);



12-04-2011

#### CART - Aalborg University

### **Multiplexor At Word-Level**



31



### Use: Select



#### Simplified select.



# ALU

#### 2 operand inputs + 1 control input.

- Operands X and Y.
- Control selects operation.
- Same principle as select, we abstract from the exact design.



# Memory & Clocking

- Memory stores states.
  - Functions only propagates signals.
  - Memory implemented as flip-flop-like circuits.
    Have feedback loops to "keep" bits.
  - Registers (hardware or program).
- Clocks synchronize when to update.
  - Between updates, signals propagate.
  - Clock signal rises  $\rightarrow$  registers are updated.

# Storing 1 Bit

**Bistable Element** 









V1

12-04-2011

# Storing 1 Bit (cont.)

**Bistable Element** 



36



# Storing and Accessing 1 Bit

#### **Bistable Element**





Resetting





Storing



#### 1-Bit Latch



Latching



Storing



### Registers





- Stores word of data
  - Different from program registers seen in assembly code
- Collection of edge-triggered latches
  - Loads input on rising edge of clock

# **Clock Synchronization**



- Register operations are synchronized.
  - Stores data bits.
  - For most of time acts as barrier between input and output.
  - As clock rises, loads input.

# **Register File**

Set of program registers.

- Local and fast access storage.
- Small.
- Fixed size (machine word).

## Memory

- Abstract & simplified model.
- Simple array of byte, no hierarchy.
  - We'll see later the hierarchy & virtual memory system.

#### Micro/Macro-code



# Complement

We'll focus on gate/logic design.

- = simplified model.
- Reality is more complex.
  - Design is at a higher level.
  - We'll see hints of correspondence to micro-code.

#### Microcode

How to implement complex CPU?

- Program the complex instructions.
- Visible machine language = macro instruction set.
- Internal language = micro-code.
- Microcontroller inside CPUs that decode and execute macro-instructions.
  - RISC
  - Processors are all RISCs in the end.
- Key: Easier to write programs with micro-code than to build hardware from scratch.

#### Microcode



## Data and Register Sizes

- Size of visible register may differ from size of internal registers.
  - Ex: Could implement 32-bit instruction set on a 16-bit microcontroller.

# Advantages/Drawbacks

#### Advantages

- Can change microcode and keep the same macro-instruction set!
- Less prone to errors, can be updated more easily.

#### Drawback

Cost in performance – overhead.

# Vertical Microcode

- Simple view of microcontroller ~ standard processor.
- Execution of micro-code like assembly.
- One micro-instruction at a time.
- Access to different units.
- Decode each macro-instruction and execute micro-code.
- Easy to read/write, bad performance.
- Not the case in practice.

### Horizontal Microcode

- Use implicit parallelism.
  - Utilize units in parallel when possible.
- Control data movements and the different hardware units at the same time.
- Very difficult to program.
- Long instruction: |exec op1 unit1|exec op2 unit2|transfer this register there|...

#### **Example Architecture**



|                                                | Unit      | Command                                       | Meaning                                       |  |  |  |  |  |  |  |  |
|------------------------------------------------|-----------|-----------------------------------------------|-----------------------------------------------|--|--|--|--|--|--|--|--|
|                                                |           | 000                                           | No operation                                  |  |  |  |  |  |  |  |  |
|                                                |           | 001                                           | Add                                           |  |  |  |  |  |  |  |  |
|                                                |           | 010                                           | Subtract                                      |  |  |  |  |  |  |  |  |
|                                                | ALU       | 011                                           | Multiply                                      |  |  |  |  |  |  |  |  |
|                                                |           | 100                                           | Divide                                        |  |  |  |  |  |  |  |  |
|                                                |           | 101                                           | Left shift                                    |  |  |  |  |  |  |  |  |
|                                                |           | 110                                           | Right shift                                   |  |  |  |  |  |  |  |  |
|                                                |           | 111                                           | Continue previous operation                   |  |  |  |  |  |  |  |  |
|                                                | operand   | 0                                             | No operation                                  |  |  |  |  |  |  |  |  |
| 1 or 21Load value from data transfer mechanism |           |                                               |                                               |  |  |  |  |  |  |  |  |
|                                                | result    | 0                                             | No operation                                  |  |  |  |  |  |  |  |  |
| 1 or 21Send value to data transfer mechanism   |           |                                               |                                               |  |  |  |  |  |  |  |  |
|                                                |           | 0 0 x x x x                                   | No operation                                  |  |  |  |  |  |  |  |  |
|                                                | register  | 0 1 x x x x                                   | Move register xxxx to data transfer mechanism |  |  |  |  |  |  |  |  |
|                                                | interface | Move data transfer mechanism to register xxxx |                                               |  |  |  |  |  |  |  |  |
|                                                |           | 1 1 x x x x                                   | No operation                                  |  |  |  |  |  |  |  |  |
|                                                | AL        |                                               | Oper. 2 Res. 1 Res. 2 Register interface      |  |  |  |  |  |  |  |  |
|                                                |           |                                               |                                               |  |  |  |  |  |  |  |  |
|                                                |           |                                               |                                               |  |  |  |  |  |  |  |  |
|                                                | x x       | x x                                           | x x x x x x x x                               |  |  |  |  |  |  |  |  |

12-04-2011

### Horizontal Microcode

- Not like conventional programs.
- Each instruction takes one cycle
  - but not all operations take one cycle
  - special care for timing, wait for units that need more cycles

|   | ALU |   | OP <sub>1</sub> | OP <sub>1</sub> OP <sub>2</sub> RES <sub>1</sub> RES <sub>2</sub> REC |   |   |   |   |   | G. INTERFACE |   |   |  |
|---|-----|---|-----------------|-----------------------------------------------------------------------|---|---|---|---|---|--------------|---|---|--|
| 1 | 1   | 1 | 0               | 0                                                                     | 0 | 0 | 0 | 0 | 0 | 0            | 0 | 0 |  |

continue

# Intelligent Microcontroller

- Schedules instructions & units.
- Handles operations in parallel.
- Performs branch prediction.
  - May try 2 paths and discard the results of the wrong one later.
  - Important: Keep the sequential semantics.
- Out-of-order execution
  - use scoreboard to keep track of results and dependencies

#### Conclusion

#### Does it matter?

- Yes! Understand your hardware and its technology.
- Use it in a better way. Reduce branches, or make them easy to guess.

Ex:

for(i =0, j = n-1; i < j; ++i,--j) swap(&a[i],&a[j]) harder than

for(i = 0; i < n/2; ++i) swap(&a[i],&a[n-1-i])