Computer Architecture & Organization – Notes - Free Download

0

1: Introduction to Computer Architecture

Basic Organization of a Computer System 🏗️

A computer system consists of four main components:

  • CPU (Central Processing Unit): The "brain" that executes instructions
  • Memory: Stores data and programs (RAM, ROM, Cache)
  • Input/Output Devices: Allow communication with the external world
  • System Bus: Connects all components for data transfer

Example: When you press a key (input), the CPU processes this information, retrieves data from memory, and displays the result on your screen (output).

Evolution of Computers 📈

  1. First Generation (1940s-1950s): Vacuum tubes, bulky, high power consumption
  2. Second Generation (1950s-1960s): Transistors, more reliable, less power
  3. Third Generation (1960s-1970s): Integrated circuits
  4. Fourth Generation (1970s-present): Microprocessors, VLSI technology
  5. Fifth Generation (Present and future): AI, quantum computing, neural networks

Example: The ENIAC (1946) weighed 30 tons and used vacuum tubes, while today's smartphones are millions of times more powerful at a fraction of the size.

Von Neumann Architecture 🧠

Key principles:

  • Stored program concept (instructions and data in same memory)
  • Sequential instruction execution
  • Single data path between CPU and memory

Components:

  • CPU (with ALU and Control Unit)
  • Memory unit
  • Input/Output system
  • System bus

Example: Most modern computers follow this architecture, where programs and data are loaded from storage into RAM before execution.

Harvard vs. Von Neumann Architecture 🔄

Von Neumann:

  • Single memory for both instructions and data
  • One bus for memory access
  • Simpler design

Harvard:

  • Separate memories for instructions and data
  • Separate buses for each memory
  • Allows simultaneous access to instructions and data

Example: Most desktop computers use Von Neumann architecture, while many microcontrollers and DSPs (like Arduino) use Harvard architecture for better performance.

Performance Metrics 📊

  • Latency: Time to complete a single task
  • Throughput: Number of tasks completed per unit time
  • CPI (Cycles Per Instruction): Average clock cycles needed per instruction
  • MIPS (Million Instructions Per Second): Measure of CPU performance

Example: A CPU with a CPI of 2.0 and clock speed of 4 GHz would have: MIPS = (Clock rate) / (CPI × 10^6) = 4 × 10^9 / (2.0 × 10^6) = 2000 MIPS


2: Digital Logic and Data Representation

Number Systems 🔢

  • Binary (Base 2): Uses only 0 and 1 (e.g., 1010₂ = 10₁₀)
  • Octal (Base 8): Uses digits 0-7 (e.g., 12₈ = 10₁₀)
  • Hexadecimal (Base 16): Uses digits 0-9 and letters A-F (e.g., A₁₆ = 10₁₀)

Conversion Example: Decimal 25 to binary: 25₁₀ = 11001₂ Binary 1101 to decimal: 1101₂ = 1×2³ + 1×2² + 0×2¹ + 1×2⁰ = 8 + 4 + 0 + 1 = 13₁₀

Boolean Algebra and Logic Gates ⚙️

Basic Operations:

  • AND (·): Output is 1 only if all inputs are 1
  • OR (+): Output is 1 if at least one input is 1
  • NOT (¬): Inverts the input

Basic Logic Gates:

  • AND, OR, NOT, NAND, NOR, XOR, XNOR

Example: Boolean expression: Y = A·B + C If A=1, B=0, C=1, then Y = 1·0 + 1 = 0 + 1 = 1

Combinational Circuits 🔌

  • Multiplexers (MUX): Selects one of many input signals and forwards it to a single output
  • Demultiplexers (DEMUX): Takes a single input and directs it to one of many outputs
  • Encoders: Converts 2ⁿ input lines to n output lines
  • Decoders: Converts n input lines to 2ⁿ output lines

Example: A 4-to-1 multiplexer has 4 data inputs, 2 selection inputs, and 1 output. The selection inputs determine which data input is connected to the output.

Sequential Circuits ⏱️

  • Flip-Flops: Basic memory element that stores one bit of data
    • Types: SR, JK, D, T
  • Registers: Group of flip-flops used to store multiple bits
  • Counters: Circuits that count pulses, sequentially going through states

Example: A 4-bit register using D flip-flops can store values from 0000 to 1111, representing numbers 0-15.

Data Representation 📊

Signed Numbers:

  • Sign-Magnitude: First bit indicates sign, rest represent magnitude
  • 1's Complement: Invert all bits to negate
  • 2's Complement: Invert all bits and add 1 to negate (most common)

Example: -5 in 8-bit 2's complement:

  1. +5 = 00000101
  2. Invert: 11111010
  3. Add 1: 11111011 (this is -5)

Floating-Point Representation (IEEE 754):

  • Sign bit (1 bit)
  • Exponent (8 bits for single precision, 11 for double)
  • Mantissa/Fraction (23 bits for single precision, 52 for double)

Example: The decimal number 10.5 in IEEE 754 single precision:

  1. Convert to binary: 1010.1
  2. Normalize: 1.0101 × 2³
  3. Sign bit = 0 (positive)
  4. Exponent = 3 + 127 (bias) = 130 = 10000010
  5. Mantissa = 0101 (rest are zeros)
  6. Result: 0 10000010 01010000000000000000000

Practice Questions ✍️

  1. What are the key differences between Harvard and Von Neumann architectures?
  2. Convert the decimal number 43 to binary, octal, and hexadecimal.
  3. Implement a full adder circuit using basic logic gates.
  4. How is the number -25 represented in 8-bit 2's complement?
  5. Calculate the CPI of a processor that executes 30% arithmetic, 40% data transfer, and 30% control instructions with CPIs of 1, 2, and 3 respectively.

3: Processor Organization and Instruction Set Architecture

CPU Organization 🧩

Main Components:

  • ALU (Arithmetic Logic Unit): Performs arithmetic and logical operations
  • Control Unit: Fetches, decodes, and executes instructions
  • Registers: Small, high-speed storage locations within the CPU
  • Buses: Communication pathways between components
    • Data Bus: Transfers actual data
    • Address Bus: Transfers memory addresses
    • Control Bus: Carries control signals

Common CPU Registers:

  • Program Counter (PC): Points to the next instruction
  • Instruction Register (IR): Holds the current instruction
  • Memory Address Register (MAR): Holds memory addresses
  • Memory Data Register (MDR): Holds data being transferred to/from memory
  • Accumulator (ACC): Stores results of operations
  • General Purpose Registers: Used for various operations

Example: During instruction execution, the PC points to the instruction in memory, the Control Unit fetches it into the IR, decodes it, and directs the ALU to perform the required operation using data from registers.

Instruction Format and Types 📝

Common Instruction Formats:

  • Single-address: Operation code + one address
  • Two-address: Operation code + two addresses
  • Three-address: Operation code + three addresses
  • Zero-address: Just the operation code (stack-based)

Instruction Types:

  • Data Transfer: Move data between locations (e.g., MOV, LOAD, STORE)
  • Arithmetic/Logical: Perform calculations (e.g., ADD, SUB, AND, OR)
  • Control Flow: Change program execution (e.g., JMP, CALL, RET)
  • I/O Instructions: Interact with external devices

Example:

ADD R1, R2, R3  # Three-address format: R1 = R2 + R3
ADD R1, R2      # Two-address format: R1 = R1 + R2

Addressing Modes 🎯

Ways to specify the location of an operand:

  • Immediate: Operand is in the instruction (e.g., ADD R1, #5)
  • Register: Operand is in a register (e.g., ADD R1, R2)
  • Direct: Instruction contains the memory address (e.g., ADD R1, [100])
  • Indirect: Instruction contains address where the actual address is stored (e.g., ADD R1, [R2])
  • Indexed: Address = Base address + Index (e.g., ADD R1, [100+R2])
  • Base-relative: Address = Base register + Offset (e.g., ADD R1, [R2+10])

Example: If R2 contains value 500, in indexed addressing mode, the instruction LOAD R1, [100+R2] would load data from memory address 600 into register R1.

RISC vs. CISC Architectures 🔄

RISC (Reduced Instruction Set Computer):

  • Simple instructions of uniform length
  • Fixed-format instructions
  • More registers, fewer memory accesses
  • Hardwired control unit
  • Emphasis on optimization by compiler
  • Examples: ARM, MIPS, RISC-V

CISC (Complex Instruction Set Computer):

  • Complex instructions of variable length
  • Many addressing modes
  • Fewer registers, more memory accesses
  • Microprogrammed control unit
  • Hardware handles complexity
  • Examples: x86, x86-64

Example: A CISC processor might have a single instruction MULT MEM1, MEM2 that multiplies values from two memory locations, while a RISC processor would require multiple simpler instructions to load values into registers, multiply them, and store the result.

Instruction Pipelining and Hazards ⚠️

Pipelining: Overlapping execution of multiple instructions, typically in stages:

  1. Fetch (IF): Get instruction from memory
  2. Decode (ID): Interpret the instruction
  3. Execute (EX): Perform the operation
  4. Memory Access (MEM): Read/write memory if needed
  5. Write Back (WB): Store results in registers

Pipeline Hazards:

  • Data Hazards: Instruction depends on result of previous instruction not yet completed
    • RAW (Read After Write): Most common, waiting for a result to be produced
    • WAR (Write After Read): Register being written to before previous read completes
    • WAW (Write After Write): Register written to twice in sequence
  • Structural Hazards: Hardware resource conflict (e.g., two stages need the same resource)
  • Control Hazards: Pipeline disruption due to branch instructions

Solutions:

  • Forwarding/Bypassing: Pass results directly between pipeline stages
  • Pipeline Stalls/Bubbles: Delay instruction execution
  • Branch Prediction: Guess outcome of branch instructions

Example:

ADD R1, R2, R3  # R1 = R2 + R3
SUB R4, R1, R5  # R4 = R1 - R5

The SUB instruction has a RAW hazard since it needs the result of ADD in R1. Forwarding can pass the result directly to the SUB instruction without waiting for the full write-back stage.

4: Memory Organization

Memory Hierarchy 📊

Arranged from fastest/smallest/most expensive to slowest/largest/least expensive:

  • Registers: Inside CPU, fastest access (< 1 ns)
  • Cache: Small, fast SRAM (L1: 1-2 ns, L2: 3-10 ns, L3: 10-20 ns)
  • Main Memory (RAM): Moderate speed and size (50-100 ns)
  • Secondary Storage: Hard drives, SSDs (ms to μs)
  • Tertiary Storage: Tapes, optical media (seconds)

Principle: Keep frequently accessed data in faster memory.

Example: When a program runs, frequently used variables are kept in registers, commonly accessed code segments in cache, and the full program in main memory, with less frequently used data in secondary storage.

Cache Memory 🚀

Purpose: Bridge the speed gap between CPU and main memory.

Cache Mapping Techniques:

  • Direct Mapping: Each main memory block maps to exactly one cache location
    • Simple but prone to conflicts
  • Fully Associative: Any memory block can go in any cache location
    • Flexible but complex search
  • Set Associative: Compromise between direct and fully associative
    • Memory blocks map to specific sets, but can go anywhere within that set

Cache Replacement Policies (when cache is full):

  • Least Recently Used (LRU): Remove the least recently accessed block
  • First-In, First-Out (FIFO): Remove the oldest block
  • Least Frequently Used (LFU): Remove the least frequently accessed block
  • Random: Remove a random block

Cache Performance Metrics:

  • Hit Ratio: Percentage of memory accesses found in cache
  • Miss Ratio: Percentage of memory accesses not found in cache

Example: In a 2-way set associative cache with 4 sets (8 blocks total), memory address 0x12345678 might map to set 2 (based on some bits of the address), and can occupy either of the two blocks in that set.

Virtual Memory 🧠

Purpose: Allow programs to use more memory than physically available by using disk space as an extension of RAM.

Key Concepts:

  • Pages: Fixed-size blocks of virtual memory
  • Frames: Corresponding blocks in physical memory
  • Page Table: Maps virtual pages to physical frames
  • Page Fault: Occurs when accessed page is not in physical memory

Virtual Memory Techniques:

  • Paging: Memory divided into fixed-size pages
    • Simple implementation
    • May have internal fragmentation
  • Segmentation: Memory divided based on logical segments (code, data, stack)
    • More natural program organization
    • May have external fragmentation
  • Paged Segmentation: Combines both approaches

Example: A program using 8 GB of virtual memory might only have 2 GB of its pages in physical RAM at any time, with the rest stored on disk and swapped in as needed.

RAM & ROM Types 💾

RAM (Random Access Memory) - Volatile:

  • SRAM (Static RAM):
    • Faster, uses flip-flops
    • More expensive, larger size
    • Used for cache memory
  • DRAM (Dynamic RAM):
    • Slower, uses capacitors that need refreshing
    • Less expensive, smaller size
    • Used for main memory

ROM (Read-Only Memory) - Non-volatile:

  • Mask ROM: Programmed during manufacturing
  • PROM (Programmable ROM): Can be programmed once
  • EPROM (Erasable PROM): Erasable with UV light
  • EEPROM (Electrically EPROM): Electrically erasable
  • Flash Memory: EEPROM variant, allows block erasure
    • Used in SSDs, USB drives, memory cards

Example: A computer might use SRAM for L1/L2 cache, DRAM for main memory, and Flash memory for storage (SSD).

Practice Questions ✍️

  1. Compare RISC and CISC architectures in terms of instruction complexity, register usage, and memory access patterns.
  2. Explain how a data hazard occurs in instruction pipelining and describe two methods to resolve it.
  3. For a direct-mapped cache with 64 blocks and a block size of 16 bytes, where would the byte at memory address 1156 be mapped?
  4. Describe the process of translating a virtual address to a physical address in a paging system.
  5. Why is SRAM used for cache memory while DRAM is used for main memory?

5: Input-Output Organization

I/O Interfaces 🔌

Key Components:

  • I/O Ports: Connection points for external devices
  • I/O Buses: Data pathways between CPU and peripherals
    • PCI, PCIe, USB, SATA, etc.
  • I/O Controllers: Specialized circuits that manage specific devices
    • Handle timing, data formatting, buffering, error detection

Communication Methods:

  • Programmed I/O: CPU controls transfer, checking device status repeatedly
  • Interrupt-Driven I/O: Device signals CPU when ready
  • Direct Memory Access (DMA): Device transfers data without CPU involvement

Example: When you connect a USB flash drive, the USB controller detects it, signals the CPU via an interrupt, and then the operating system loads the appropriate driver to communicate with the device.

Memory-Mapped vs. I/O-Mapped I/O 🗺️

Memory-Mapped I/O:

  • I/O devices share address space with memory
  • Same instructions for memory and I/O operations
  • Flexible addressing capabilities
  • Examples: ARM, MIPS

I/O-Mapped I/O (Isolated I/O):

  • Separate address spaces for memory and I/O
  • Special instructions for I/O operations (e.g., IN, OUT)
  • Limited addressing range for I/O
  • Examples: x86 architecture

Example:

// Memory-mapped I/O
MOV R1, [0xFFFF0004]  // Read from I/O device at address 0xFFFF0004

// I/O-mapped I/O
IN R1, 0x60           // Read from I/O port 0x60 (keyboard controller in x86)

Interrupts and DMA ⚡

Interrupts:

  • Signals that cause CPU to pause current execution and handle an event
  • Types:
    • Hardware Interrupts: From external devices (keyboard, timer, etc.)
    • Software Interrupts: Generated by programs (syscalls, exceptions)
    • Maskable: Can be disabled
    • Non-maskable: Critical, cannot be disabled

Interrupt Handling Process:

  1. Device asserts interrupt signal
  2. CPU finishes current instruction
  3. CPU saves context (PC, flags, registers)
  4. CPU jumps to interrupt service routine (ISR)
  5. ISR handles the interrupt
  6. CPU restores context and resumes normal execution

Direct Memory Access (DMA):

  • Allows devices to transfer data directly to/from memory without CPU involvement
  • CPU initiates transfer, then continues other tasks
  • DMA controller manages the transfer
  • Significantly reduces CPU overhead for large data transfers

Example: When printing a large document, instead of the CPU transferring each byte to the printer, it configures the DMA controller to transfer the entire document while the CPU handles other tasks.

Peripheral Devices 🖨️

Keyboards:

  • Matrix of switches scanned by controller
  • Generates scan codes for key press/release
  • Controller sends data serially to computer

Display Units:

  • CRT (Cathode Ray Tube): Electron beam scans phosphor screen
  • LCD (Liquid Crystal Display): Liquid crystals alter polarization of light
  • LED (Light Emitting Diode): Arrays of light-emitting diodes
  • OLED (Organic LED): Self-illuminating pixels, no backlight needed

Printers:

  • Impact: Physical contact with paper (dot matrix, daisy wheel)
  • Non-impact: No physical contact (laser, inkjet, thermal)
  • Connected via USB, network, or wireless interfaces

Example: A laser printer receives data via USB, stores it in buffer memory, then uses a laser to create a pattern on a photosensitive drum, which attracts toner that is transferred to paper and fused with heat.

6: Parallel and Advanced Architectures

Flynn's Classification 🧩

Based on instruction and data streams:

  • SISD (Single Instruction, Single Data):

    • Traditional uniprocessor
    • One instruction stream, one data stream
    • Example: Classic single-core CPU
  • SIMD (Single Instruction, Multiple Data):

    • Same operation performed on multiple data points simultaneously
    • Vector processors, GPU operations
    • Examples: SSE/AVX instructions, NVIDIA CUDA
  • MISD (Multiple Instruction, Single Data):

    • Different operations on the same data
    • Rare in practice
    • Example: Some fault-tolerant systems
  • MIMD (Multiple Instruction, Multiple Data):

    • Multiple processors executing different instructions on different data
    • Most common parallel architecture
    • Examples: Multicore CPUs, computer clusters

Example: In SIMD processing, a single instruction like "ADD" might be applied to 8 pairs of numbers simultaneously, making operations like video processing much faster.

Multicore Processors 🔄

Architecture:

  • Multiple CPU cores on a single chip
  • Shared or separate caches
  • Interconnection network between cores

Key Features:

  • Shared Memory: All cores access the same physical memory
  • Cache Coherency: Mechanisms to keep multiple caches consistent
  • Thread-Level Parallelism: Multiple threads execute simultaneously

Challenges:

  • Synchronization between cores
  • Load balancing
  • Memory contention
  • Power and heat management

Example: A quad-core processor can run four threads simultaneously. If an image processing application is designed to use multiple threads, it can process different parts of an image on different cores, completing the task much faster than a single-core system.

GPU Architecture Basics 🎮

Graphics Processing Units (GPUs):

  • Specialized processors optimized for parallel operations
  • Thousands of simple cores vs. few complex cores in CPUs
  • Optimized for floating-point calculations and matrix operations

Architecture Components:

  • Streaming Multiprocessors (SMs): Groups of cores
  • CUDA Cores/Stream Processors: Individual processing units
  • Memory Hierarchy: Global memory, shared memory, registers
  • Scheduler: Manages work distribution

Applications:

  • Graphics rendering
  • Scientific computing
  • Machine learning/AI
  • Cryptocurrency mining

Example: Training a neural network on a GPU can be 10-100x faster than on a CPU because matrix multiplications can be highly parallelized across thousands of GPU cores.

Introduction to Quantum Computing ⚛️

Key Concepts:

  • Qubits: Quantum bits, can exist in superposition of states (both 0 and 1 simultaneously)
  • Superposition: Ability to represent multiple states at once
  • Entanglement: Qubits linked regardless of distance
  • Quantum Gates: Operations on qubits

Potential Advantages:

  • Exponentially faster for certain problems
  • Efficient factoring of large numbers (Shor's algorithm)
  • Quantum search (Grover's algorithm)
  • Quantum simulation

Challenges:

  • Decoherence and error correction
  • Scaling up number of qubits
  • Programming quantum algorithms

Example: While a classical computer would need to try each possible solution to break a 256-bit encryption key (2^256 operations), a quantum computer using Shor's algorithm could potentially do it with far fewer operations, threatening current cryptographic systems.

7: Advanced Topics in Computer Organization

Pipelining and Superscalar Architectures 🚀

Pipelining (Review):

  • Overlapping instruction execution stages
  • Increases throughput but not individual instruction latency

Superscalar Architecture:

  • Multiple pipelines operating in parallel
  • Can issue and execute multiple instructions per cycle
  • Dynamic scheduling of instructions
  • Examples: Modern x86 and ARM processors

Challenges:

  • Instruction dependencies
  • Limited instruction-level parallelism in code
  • Complex scheduling logic

Example: A 4-way superscalar processor can potentially execute 4 instructions per clock cycle if they are independent, achieving up to 4 times the performance of a scalar processor at the same clock speed.

Vector Processors and Array Processing 📊

Vector Processors:

  • Optimized for operations on data arrays (vectors)
  • Single instruction operates on multiple data elements
  • Deep pipelines for vector operations
  • Examples: NEC SX series, historical Cray supercomputers

Array Processors:

  • Multiple processing elements arranged in an array
  • Each processor executes the same instruction on different data
  • Good for problems with regular data patterns
  • Examples: Systolic arrays, some AI accelerators

Applications:

  • Scientific simulations
  • Weather forecasting
  • Computational fluid dynamics
  • Signal processing

Example: A vector processor can add two 64-element vectors using a single vector addition instruction, whereas a scalar processor would need 64 separate addition instructions.

Performance Enhancement Techniques 📈

Branch Prediction:

  • Predicts direction of conditional branches
  • Static: Based on instruction syntax
  • Dynamic: Based on runtime behavior
  • Branch Target Buffer (BTB): Caches branch destinations

Out-of-Order Execution:

  • Executes instructions based on data availability rather than program order
  • Uses techniques like:
    • Register Renaming: Eliminates false dependencies
    • Reservation Stations: Hold instructions waiting for operands
    • Reorder Buffer: Ensures results are committed in program order

Speculative Execution:

  • Executes instructions before knowing if they are needed
  • Results are committed only if speculation was correct
  • Discarded if incorrect (e.g., branch misprediction)

Example: When encountering a branch instruction, a processor might predict it will be taken based on past behavior, then speculatively execute instructions from the predicted path. If the prediction is correct, execution continues without delay; if wrong, the speculative work is discarded and execution restarts from the correct path.

Cloud Computing and Server Architectures ☁️

Server Architecture Components:

  • High-performance CPUs: Multiple sockets, many cores
  • Large memory capacity: Hundreds of GB to TB of RAM
  • High-bandwidth networking: 10/40/100 Gbps connections
  • Redundant power and cooling: For reliability

Cloud Computing Models:

  • Infrastructure as a Service (IaaS): Virtual machines, storage
  • Platform as a Service (PaaS): Development and deployment environments
  • Software as a Service (SaaS): Applications delivered over the internet

Virtualization:

  • Hypervisors: Software that creates and manages virtual machines
  • Containers: Lightweight virtualization without full OS
  • Resource management: CPU, memory, I/O allocation

Data Center Architecture:

  • Racks of servers connected by high-speed networks
  • Storage area networks (SANs)
  • Load balancers and firewalls
  • Power distribution and cooling systems

Example: A cloud provider like AWS might use thousands of multi-socket servers, each with 64+ CPU cores and 512GB+ RAM, arranged in racks and connected by high-speed networks. These physical resources are then partitioned using virtualization to provide virtual machines of various sizes to customers.

Practice Questions ✍️

  1. Compare and contrast memory-mapped I/O and I/O-mapped I/O approaches, including their advantages and use cases.
  2. Explain how DMA improves system performance compared to programmed I/O, and describe a scenario where it would be particularly beneficial.
  3. Based on Flynn's taxonomy, classify the following: (a) a standard laptop CPU, (b) a GPU running a shader program across multiple pixels, (c) a dual-processor server.
  4. How does branch prediction contribute to CPU performance, and what happens during a branch misprediction in a modern pipelined processor?
  5. Describe three key differences between traditional server architecture and cloud computing infrastructure.

Tags

Post a Comment

0Comments
Post a Comment (0)
Let's Chat
Let's Chat

Let's Make Magic Happen! ✨

Drop us a message or join our whatsapp group