Processor

Basics of a CPU

Processors, or CPUs, are the “beating heart” of a computer, and as such, perform many more tasks than the ones that will be discussed in this text. The focus lies on the floating point calculation capabilities and the memory subsystems, as well as the interfaces.

The traditional CPU (Central Processing Unit) is an Integrated Circuit (IC) that executes the logic, arithmetic, input/output (I/O) and control operations that are prescribed by software running on the computer. As time passed, many other subsystems of computers got integrated into the processor package (die), making the functions that the traditional CPU performs only a subset of all the functions that a modern processor performs. A “core” of a modern processor is a separate unit that performs all the tasks of a traditional CPU.

Modern processor cores are very diverse, complex and multi-faceted, varying wildly with ISA, microarchitecture, intended platform and manufacturer. A discussion about CPUs that would include all these variations would be impossible, necessitating a confinement. This discussion will try to be as generic as possible, but the focus lies on an Intel based server CPU of the “Skylake-SP” microarchitecture.

Intel dominates the PC/laptop as well as the HPC/Supercomputer CPU market, making the restriction towards an Intel x86-64 based CPU justified. Considering the fact that almost all Laptops and workstations contain CPUs that are (to a varying extend) derived from their server oriented counterparts, focusing the discussion around a server CPU seems logical as well. The Skylake-SP microarchitecture was chosen because it is very recent (at the time of writing), contains some very significant advancements for scientific computing workloads and is used in a cluster available to the MEFD group at the TU/e.

System on a Chip

Modern processors are best described by the “System on a Chip” (SoC) moniker, containing many of the core components of a computer. As such, most contain the following subsystems:

Instruction set architecture

An Instruction Set Architecture (ISA) is an abstract model of a computer and contains a collection of machine instruction definitions. Examples of common ISA’s are x86-64, x86 and ARM, with x86-64 being the most common ISA for CPUs in servers, as well as consumer oriented computers.

An ISA is one of the most important aspects of a CPU, because it forms the link between software and hardware. ISA’s where introduced to make programming software easier, which could now be written in terms of ISA instructions in stead of low level machine code. This made it possible to execute the same computer program on different computers, without any modification of the code.

An implementation of an ISA, called a microarchitecture (uarch), is the hardware based realization of these machine instruction definitions (disregarding microcode). Any specific uarch can also support extensions to its ISA, common examples are VT-d, AES-NI, SSE4 and AVX2. These extensions are additions to the abstract computer model of an ISA and contain specific instructions to accelerate certain tasks of a computer, like AES data encryption, virtualization and vector mathematics.

Some better known examples of microarchitectures are Intel i386, Intel Nahelem, Intel Haswell, AMD K8, AMD Bulldozer and AMD Zen.

Threads and cores

A thread is a chain of instructions that is to be executed by the CPU. Each thread is generated by a process, which can loosely be described as an instance of a computer program. A single core of a CPU can execute one thread at a time, but using a technique called time slicing and the concept of context switching, can handle multiple threads concurrently.

Multithreading (software) allows a single process to spawn a multitude of threads, dividing the workload of that process. Performance benefits (can) arise when these threads are executed in parallel on multiple cores of a CPU.

The specifications of a CPU may contain references to the number of threads it “has”, which should be interpreted as the maximum amount of threads it can “execute” at the same time. The fact that a single CPU core can only execute one thread at a time doesn’t change, even if the CPU specifications state that it has more (twice) threads then cores. This has to do with a hardware based technique called simultaneous multithreading (SMT), which will be discussed later.

Cache

Quick access to data is critical for the performance of a CPU, making data flow and storage a mayor aspect of a CPU. The main memory of a computer has a relatively high latency and low bandwidth compared to the needs of modern CPU cores, which is where cache comes into play.

Cache is storage subsystem of the CPU and acts like a data buffer. It contains, among other things, copies of the data that the CPU (or process) “predicts” it will access often or in the near future, reducing the loading time of that data. The amount of cache placed on the CPU is relatively small, typically about 1:1.000, compared to the amount of main memory placed in a computer. A “cache-miss” refers to the situation where data is requested, but not stored in cache, resulting in a much longer loading time. Avoiding cache-misses is a large part of software (and hardware) optimization and can lead to very substantial performance improvements.

Cache memory pressure (memory nearly full) and the ever increasing speed of CPU cores lead to the development of multiple levels of cache. This layered structure has the advantage that it can address both the memory pressure problem as well as the demand for faster data access, without resulting in prohibitive costs. The upper most layer of cache, L1, has gotten significantly faster over time, but did not really increase much in capacity. The lowest level of cache, typically L3, saw the highest increase in capacity, but is also substantially slower then L1.

The development of multi-core processors and several levels of cache created an additional task for cache; inter-core data communication. Each core has its own private parts of L1 and L2 cache, whereas L3 cache is shared between the cores. This last level of cache is the place where data can be shared between the threads running on the different cores.

Cache memory is a lot faster, and generally superior on many fronts, compared to main memory, because cache is made from SRAM and main memory is made from DRAM. SRAM stands for “Static Random Access Memory”, whereas DRAM stands for “Dynamic Random Access Memory”. Each SRAM memory cell requires 6 transistors to store a bit, whereas DRAM requires only one transistor (and a small capacitor) per bit. The downside of DRAM is that the capacitors in DRAM memory need to be recharged frequently, causing delays and other problems. This constant refreshing of the stored data gave rise to the name “Dynamic”, while “Static” was used for SRAM, because it doesn’t need to be refreshed. The extra hardware complexity of SRAM allows it to be much faster than DRAM, but the extra cost and space requirements on the die of the CPU also make it much more expensive.

Execution units

Execution units of a CPU core are the parts that execute the machine instructions derived from the thread running on the core. There are many different types of execution units in modern CPU cores, each with their own specific function. Notable examples of execution units are; arithmetic logic unit (ALU), address generation unit (AGU) and floating-point unit (FPU). Discussing the functions and operations of all these execution units is beyond the scope of this text, which will focus on the floating-point execution unit.

Superscalar

CPU cores have many execution units, most also have multiple execution units of the same type. Keeping all the execution units busy at the same time requires multiple instructions to be dispatched (one instruction per execution unit) simultaneously. The ability of a CPU core to dispatch multiple instructions simultaneously is called being superscalar, whereas CPUs that can only dispatch a single instruction are called scalar. Superscalar capabilities are a form of instruction-level parallelism.

SIMD

SIMD stands for single instruction, multiple data and is a form of data level parallelism. SIMD is a vector processing technique, allowing an execution unit to perform the same instruction on multiple data entries (grouped in 1D arrays called vectors) in a single clock cycle. The maximum achievable throughput increases substantially by using SIMD, but does requires all the data to be manipulated in the same way, making it less versatile.

SIMD works by exposing deep registers to execution units, containing the data vectors. The instruction that the execution unit receives is performed on the complete register.

FMA3 and advanced vector extensions

The floating-point execution units found in modern Intel CPU cores are based on FMA3 (from Haswell on wards). These Fused Multiply Add based units are capable of three different operations:

The Skylake-SP uarch contains two AVX-512 execution units, with 512 bit deep registers. Each AVX-512 unit contains 8 FMA3 sub-units for “double” floating-point numbers and 16 FMA3 sub-units for “single” floating-point numbers. These AVX-512 execution units form the hardware layer of the AVX-512 ISA extension.

The Haswell and Broadwell uarch contain two AVX2 execution units, with 256 bit deep registers. Each AVX2 unit contains 4 FMA3 sub-units for “double” floating-point numbers and 8 FMA3 sub-units for “single” floating-point numbers. These AVX2 execution units form the hardware layer of the AVX2 ISA extension.

AMD’s latest (at the time of writing) Zen architecture also supports AVX2 instructions (not AVX-512), but the hardware based implementation is completely different. It doesn’t have native support for 256 bit deep registers and each AVX2 instruction takes 2 clock cycles to complete, compared to one clock cycle of Intel based AVX2 capable CPUs.

Simultaneous multithreading

Simultaneous multithreading (SMT), also known as Hyper-Threading, is a technique aimed at increasing the utilization of a CPU core. Each physical CPU core represents multiple logical CPU cores, fully transparent to the Operating System (OS). Every logical CPU core gets their own thread assignment by the OS, meaning that a single CPU core is tasked with multiple threads. The amount of logical cores per physical core differs from microarchitecture to microarchitecture. The most common is two logical cores per physical core, but Intel Xeon Phi and IBM POWER9 based designs have 4 and up to 8 logical cores per physical core.

When a thread (assigned to a CPU core) is unable to assign tasks to every execution unit of the CPU core, instructions of a different thread assigned to the same CPU core can be dispatched to the unused execution units. One of the most common culprits for a thread to under utilize the execution units, is cache misses. SMT can be very helpful in hiding the latency caused by data requests that require multiple clock cycles to fulfill.

The performance improvements generated by SMT vary wildly with applications, from a factor of two down too a performance decrease. SMT has the largest positive effect in situations where cache-misses are frequent, instruction level parallelism per thread is low and the workloads of the threads are very heterogeneous. Math routines provided by highly optimized libraries, such as the Intel Math Kernel Library (MKL), generally don’t fall into this category, making SMT less beneficial for HPC purposes.

Operating frequency

The (high level) building blocks of digital circuits are called logic gates, which are hardware implementations of boolean operations. These logic gates are combined in intricate ways to provide more high level functionality, like the FMA operation on floating-point data. Synchronization of the logic gates is key for their operation, which is where the clock signal comes into play.

The clock signal of the CPU core is a square wave signal, switching between high and low (logical “on” and “off”). The rising and falling edges of this square wave are the “timing” signals for the logic gates to evaluate their input. The frequency of this timing signal is called the operating frequency or clock frequency of the CPU core. The operating frequency of a CPU core is directly linked to the throughput of micro-operations, making it a key factor in the performance of a CPU.

Turbo frequencies

Almost all modern CPUs employ some sort of dynamic operating frequency control, allowing the operating frequency of various parts of the CPU to go up or down in conjunction with demand and thermal headroom. Dynamic scaling of the operating frequencies took a flight when mobile devices became more popular, requiring momentary high performance and long battery life. The basic idea behind these techniques is that a relatively high operating frequency can be achieved for a short duration of time. This increases performance for workloads that can be completed within the time frame of the elevated operating frequency, but doesn’t significantly increase the overall heat production and power consumption of the CPU. The turbo boost technology of modern CPUs is too complex to discuss in great detail in this text, but a few important aspects pertaining to floating point performance will be explained.

Specifications of the turbo frequencies are very important to the performance of a CPU, but are often reduced to a single number, masking the complete story for marketing purposes. Turbo frequencies scale down according to the workload type (normal, AVX2 and AVX512) and the amount of active cores. AVX512 workloads produce the most heat and highest power consumption because their execution units contain the most transistors, AVX2 execution units require less power and normal (non floating-point) workloads even less than that.

A more detailed specification of the turbo frequencies of an Intel Xeon Gold 6132 CPU will be provided as an example. Intel ark based information on the Xeon Gold 6132 specifies a base frequency of 2.6 GHz and a maximum turbo boost of 3.7 GHz. WikiChip provides more details on this page. The highest floating-point performance of the Xeon Gold 6132 is achieved when all cores are utilizing their AVX512 execution units. The maximum turbo frequency of this workload is 2.3 GHz, which is about 40% lower than the maximum frequency (3.7 GHz) provided by Intel Ark.

CPU interfaces

As stated earlier, modern CPUs are more akin to SoC’s than traditional processors. One of the most significant advancements has been the integration of interfaces onto the CPU package. This allows for higher bandwidth and lower latencies because fewer signals have to pass over PCB traces of the motherboard.

Memory controller

The memory controller is the connecting element between the CPU cores and the DRAM memory. Many scientific computing workloads have data sets that don’t fit in the cache of the CPU, forcing extensive usage of DRAM memory. These kinds of workloads usually benefit heavily from high bandwidth and low latency main memory, making the memory controller crucial for performance. Modern memory controllers are very complex pieces of engineering and explaining their operation is well beyond the scope of this text, which will focus on their features instead.

One of the most distinguishing aspects of a memory controller is the amount of memory channels it supports. A memory channel is a 64-bit wide interface to a cluster of DRAM chips, usually located on a DIMM. The peak bandwidth can be increased by allowing parallel access to multiple memory channels, making dual channel memory twice as fast as single channel memory, while latency remains unaffected. Typical consumer PC and laptop CPUs have an integrated memory controller capable of dual channel, whereas the Intel Skylake-SP chips contain two memory controller, each supporting triple channel for an effective 6 channel memory system.

A second important aspect of a memory controller/system is support for Error Correcting Code (ECC), which is a technique to detect and correct certain memory errors. The information stored in a memory cell can get corrupted by a faulty power supply or interaction with solar radiation, resulting in a “bit flip”. ECC capable memory stores additional bits of parity data to detect and (when possible) repair these corruptions. Bit flips are not very common and most consumer applications don’t suffer terribly when they encounter one (system may crash), but bit flips in sensitive, long running and expensive simulations are much more problematic. That is why ECC is almost always employed in servers, despite its drawbacks (higher cost and latency).

PCI-express

PCI-express (Peripheral Component Interconnect Express) is the dominant interconnecting interface for computer components. Basically everything other then DRAM is connected to the CPU via (a derived form of) PCIe (PCI-express), making it a very important element of a computer. Some examples of components that are connected to the CPU via PCIe are:

PCIe is a serial bus, introduced to replace PCI(-X) and AGP. The PCIe standard has seen multiple revisions (backwards compatible) over the years since its introduction, improving (among other things); bandwidth, power delivery, error correcting overhead and features. The most common implementation of the standard (at the time of writing) is version 3.x, which will be the version that this text considers.

PCIe links consist of “lanes”, each lane having four physical connections. Two connections for a differential read signal and the other two for a differential write signal, amounting to a full-duplex connection. A PCIe link to a device may be a grouping of multiple lanes, ranging from one to 32 lanes per link (x1, x2, x4, x8, x16, x32). GPUs are commonly connected using a x16 link, SAS controllers and NVMe storage typically use a x4/x8 link and single port NIC’s have a x1 link.

The bandwidth of a PCIe v3 x1 link is specified using Giga Transfers per second (GT/s), which specifies the amount of bits that can be transferred from the host to the client or vice versa. The PCIe v3 standard uses 128b/130b encoding for error correcting purposes, meaning that for every 130 bits transmitted, only 128 bits contain data and the remaining two bits contain a form of parity data. This means that a PCIe v3 x1 link of 8 GT/s has a bandwidth of 985 MB/s (8000 x (128/130) x (1/8) = 984.62) and a x16 link has a bandwidth of 15.75 GB/s.

DMI

The Direct Media Interface (DMI) interconnect is an Intel specific protocol used to connect the CPU to the Platform Controller Hub (PCH), which is (among other things) responsible for USB and SATA connectivity. DMI is a prime example of an interconnect specification derived from PCIe, with a DMI 3.0 link being nearly equivalent to PCIe v3 x4 link.

QPI and UPI

Most high-end servers allow the placement of multiple identical CPUs on the same motherboard, allowing two or four CPUs to be part of the same computer. Intel QuickPath Interconnect (QPI) and its successor, Intel UltraPath Interconnect (UPI), are interfaces primarily used for inter CPU communication on these multi socket machines.

The bandwidth of the connection between the CPUs can be important in a number of scenarios, for example:

The total bandwidth of a QPI/UPI connection is its transfer speed specification times four. The Intel Xeon Gold 6132 has a UPI link speed of 10.6 GT/s, amounting to 42.4 GB/s of bandwidth. Note that this is considerably less then the maximum memory bandwidth of the Xeon Gold 6132, which is 119.21 GiB/s.

Graphics card

Basics of a GPU

GPU (Graphics Processing Unit), graphics card and video card are all names for the piece of hardware inside a computer that is responsible for creating the image on the display of a PC. GPUs can be divided into two groups; dedicated and integrated. Integrated GPUs are part of the CPU SoC, whereas dedicated GPUs are separate circuitries, often housed on their own PCBs. These separate PCBs are usually referred to as “cards”, which is where the names graphics card and video card stems from.

It should be noted that not all computers have a GPU. Servers and embedded systems are often “headless”, which means without a display. They do not explicitly require a GPU, and as such sometimes don’t have one. Interfacing with these systems is often possible via a serial connection. However, not having a GPU is very uncommon these days, even for headless servers.

As PC’s grew more capable, the tasks placed on GPUs also grew. The traditional GPU was not much more than a framebuffer, but the introduction of graphical user interfaces and video games gave rise to 2D and 3D requirements. Filling the frame buffer with data used to be a task of the CPU, but graphics accelerators took over most of the common graphics drawing commands from the CPU. Accelerators for 2D, accelerators for 3D and the frame buffer where combined into a single device now known as a GPU.

Video gaming evolved into a huge industry, requiring ever more powerful GPUs to sustain the demand for graphical compute workloads. The hardware became capable and versatile enough to take over additional tasks from the CPU, most notably media encoding and decoding. At this point, the pattern of the GPU taking over tasks from the CPU was well established. Anticipating that this trend would only continue, NVIDIA decided that the way in which programmers should have access to the compute capabilities of GPUs needed to change into a less graphics centered manner. This idea came to fruition in the way of Compute Unified Device Architecture, or CUDA. Other GPU manufacturers followed suit and often provide a similar architecture to CUDA.

System on a Chip

Like modern CPUs, GPUs can also be described as systems on a chip, or SoC. They contain many of the same subsystems as modern CPUs:

Their functions don’t differ that much from their CPU counterparts, other than the fact that they are optimized for their graphics related workloads and the fact that the cores are not of the general purpose kind. The features that GPU cores lack compared to CPU cores makes it impossible for a GPU to perform certain tasks, like running an operating system.

Massively parallel

GPUs have a lot in common with CPUs, but the very specific workloads envisioned for GPUs caused them to be very distinctive in one aspect, they are massively parallel in nature. A typical high end CPU like the Intel Xeon Gold 6132 has a total of 28 floating point execution units, whereas a NVIDIA V100 GPU has 320 floating point execution units.

NVIDIA

There are many vendors of GPUs (AMD, NVIDIA, Intel, Imagination Technologies), most of which have products capable of some sort of compute support via OpenCL. OpenCL stands for Open Compute Language and is a fully open source and portable framework for compute on CPUs and GPUs. The main competitor of OpenCL is CUDA, which is proprietary to NVIDIA.

CUDA is somewhat older than OpenCL and had a mature implementation before OpenCL, allowing CUDA to develop a head start with respect to the development of a GPGPU compute ecosystem. Combined with the very large dedicated graphics card market share of NVIDIA, results in CUDA being a dominant force in the GPGPU compute world. The most obvious disadvantage of this situation is vendor lock-in in NVIDIA’s favor.

Many of the topics that will be discussed are manufacturer agnostic, but certain aspects like hardware design specifics and programming models are manufacturer specific. This text will try to be as generic as possible, but uses NVIDIA product as a baseline.

ISA and IR

The instruction set architecture of a CPU is one of its most defining aspects. The same goes for GPUs, but the effect on the end user is much less severe.

The concept of an ISA was introduced to aid in the quest for code portability. Software was no longer written in machine specific code, but using instructions provided by the ISA. This made it possible for software to run on physically different hardware, as long as they conform to the same ISA. The ISA concept was a huge step towards code portability, but full code portability was not yet achieved. A c++ program compiled on an x86 based computer will not run on an ARM based smartphone, because they adhere to different ISAs.

The next step towards achieving code portability is the process virtual machine, sometimes referred to as a runtime environment. The best known example of a process virtual machine is the Java Virtual Machine, also know as JVM or JRE. A process virtual machine is a piece of software that acts as a translation layer between the OS/ISA and the process virtual machine target code. All applications that are written (exclusively) in the virtual machine target code can run correctly on all platforms that have the process virtual machine available. This is why pure Python code can run both on an x86 based computer, as well as an ARM based android smartphone.

Code portability is an important concern for GPUs as well. GPUs from NVIDIA have a different ISA then GPUs from AMD, but users still expect that the applications they run on their computer functions correctly, irrespective of the brand of GPU that they have in their system. Achieving this goal is a complex and multifaceted task, but a large part of the solution is the driver of the GPU.

A GPU driver is at some level very comparable to a process virtual machine. It is a piece of software that acts as a translation layer between the hardware of the GPU and the code it receives. When this code is intended to manipulate the image on the screen, it is typically based on a graphics API like OpenGL, Vulkan or DirectX. GPGPU applications typically make use of CUDA, OpenCL or other API’s capable of GPU offloading, like modern versions of OpenMP and OpenACC.

Because the GPU driver needs to support a plethora of APIs, GPU manufacturers generally resort to a solution that contains an intermediate representation (IR). This means that the driver contains various software libraries that map the commands of an API to the IR and a piece of software that translates the IR to machine instructions. This IR is the closest thing to an ISA that GPU manufacturers provide, with the commonality that code written in this IR language will run on all systems that contain the IR to machine code translator, just as code compiled for a specific ISA will run on all the hardware that conforms to this ISA.

Microarchitecture and extentions

Microarchitectures in the GPU sense are slightly different from microarchitectures in the CPU sense, because they are not really implementations of an ISA. Nevertheless, the various generations of GPU chips are categorized as different microarchitectures.

Extending the functionality of a CPU involves ISA extensions, like AVX2. Extending the functionality of a GPU is achieved somewhat differently, because backwards compatibility can largely be provided by the IR to machine instructions translator software. However, exposing new hardware functionalities to the user sometimes requires additions to the IR language. These additions come in the form of new versions of the IR, which are generally backwards compatible. This means that code that is written in IR language version X will also execute on a system that provides IR language version Y, as long as X < Y.

NVIDIA Volta

Maintaining backwards compatibility in software is a lot more flexible then in hardware, which is one of the reasons that succeeding microarchitectures of GPUs can vary rather wildly in design. This makes it prudent to zoom in on a specific microarchitecture, namely NVIDIA Volta.

The NVIDIA Volta microarchitecture is developed specifically for GPGPU purposes, and as such is (almost) exclusively used by various incarnations of the Tesla V100 card. It is capable of unrestricted half and double precision floating point operations, has special “tensor” cores and very high memory bandwidth. These features are unavailable to most other NVIDIA consumer grade GPUs (at the time of writing) and are very important to achieving high performance in various GPGPU applications.

Programming model

The programming model of a GPU has a more layered structure than the programming model of a traditional CPU based application. The mostly embarrassingly parallel workloads intended for the GPU gave rise to a programming model that was designed from the ground up to cater to the needs of splitting up an application into many independent pieces. Some of the nomenclature found in the programming model for GPUs is manufacturer specific, but the underlying concepts are usually present in the implementations of all the manufacturers of GPUs. This text will adhere to the names as they are presented in the documentation of NVIDIA PTX, but the names of the equivalent concepts within OpenCL will be be provided for convenience. Furthermore, it is assumed that the code constructed by the programmer is in the CUDA language, not in the intermediate representation.

Host and device

One of the most important aspects of the GPU programming model is the distinction between host and device.

GPGPU applications basically contain two parts, sections that run on the host (CPU) and sections that run on the device (GPU). This is important because they both have their own and exclusive memory pool. The GPU is unable to execute instructions on data that is stored on the host memory (main memory) and the CPU is unable to execute instructions on data that is stored on the device memory (VRAM). This necessitates transferring data between host and device, which are (generally) connected via the PCI-e interface.

Transferring data to and from the device is a time consuming operation and can easily become a bottleneck for the performance of the application. In an effort to reduce the data transfers to an absolute minimum, management of the data location has been left to the programmer, so that it can be tailored to the requirements of the algorithm of the GPGPU application.

OpenCL: “host device” is equivalent to “host” and “compute device” is equivalent to “device”.

SPMD and kernel

Single program, multiple data (SPMD) is the abstraction level that forms the base of the GPU programming model. Comparing SPMD to other members of Flynn’s taxonomy shows that the low level concept of instruction has been replaced by the high level abstraction program. This makes it both more accessible to the uninitiated as well as applicable to a wider range of situations. The disadvantage is that the additional abstraction creates a greater distance between the concept and the implementation.

SPMD is a technique that is centered around the idea that a relatively simple and data independent program needs to be applied to a large quantity of small data sets. This program, which is called a kernel in a GPGPU setting, is executed in parallel on the small data sets, reducing the overall runtime compared to a sequential approach.

SIMT

A kernel consists of a chain of instructions, which typically contain many floating point operations if they are to be executed on the GPU. Executing these instructions efficiently on many cores and deep registers hardware, like GPUs, relies heavily on both instruction level parallelism and thread level parallelism. Combining these two forms of parallelism results in the MIMD architecture as defined by Flynn’s taxonomy. The problem with MIMD is that the two forms of parallelism encapsulated in MIMD require different programming techniques to utilize, which is undesirable.

SIMT stands for “single instruction, multiple threads” and has been introduced by NVIDIA. It aims to provide a single execution model on hardware that concurs to the MIMD architecture, requiring only one programming technique to utilize. The effort of dividing the workload amongst the different cores and registers of the execution units is much less of a responsibility of the programmer, but more so of the toolchain. GPGPU programming using CUDA relies on SIMT, where the programmer can control various aspects using concepts like threads, warps, blocks and grids. This allows the programmer to utilize the hardware in the most effective manner, without having to resort to explicit control over registers. However, it remains important to understand that the SIMT and latency hiding techniques provided by CUDA are basically abstractions of the MIMD architecture and SMT.

Threads and warps

A thread in the traditional CPU context is a chain of instructions that can operate on one or more data streams, where multiple data streams are processed in parallel using the SIMD mechanism. The instructions in the thread represent the workload of the execution units in the CPU, meaning that an instruction contains a couple of items:

A SIMT thread is different because it is more of an abstract concept, as it contains incomplete instructions:

This incomplete set of input and output data locations represent a single data stream of a SIMD capable execution unit, that is why a SIMT thread is sometimes referred to as a SIMD lane instruction stream. A complete instruction would contain the input and output data locations of all the data streams that SIMD capable execution unit processes.

The advantage of SIMT threads is that they are relatively intuitive, because they represent the workings of the compute kernel at the data stream level. However, they have the disadvantage that they do not have a simple mapping to the hardware that is supposed to execute them. This mapping of the threads to the real hardware instructions mostly comes down to grouping them, such that a single group contains threads with compatible incomplete instruction sequences. This grouping of threads serves to fill the deep registers of the execution units and a single group of threads is called a warp.

The toolchain is responsible for grouping the threads into warps, which reduces the workload of the programmer. The downside of this loss of explicit control is that threads that contain data dependant diverging control flow patterns, like “if else” blocks, might be grouped together in a single warp. This means that various subsections of the same warp have to perform different instructions, which is against the intended operation of warps. This problem is circumvented by applying a mask to the threads of a warp, defining if a thread is active or not. This ensures that all the active threads of the warp perform the same instruction, with the various differing subsections of the warp masked consecutively.

OpenCL: “work item” is equivalent to “thread” and “wavefront” is equivalent to “warp”.

Thread hierarchy

GPGPU applications can generate massive amounts of threads, making it imperative to have some sort of hierarchy to these threads. The hierarchy that the GPU programming model provides aims to aid in two tasks:

One of the most important features available to the algorithm design process is thread identification. Each thread has a unique (local) identifier in the form of a tuple of (non-negative) integer values, with a single tuple containing at most three elements. The amount of elements that a tuple contains can be controlled by the programmer and represents the “dimension”, mimicking dimensions encountered in the underlying physics or mathematics problem that the algorithms tries to model. Even though this thread identifier appears to be very similar to a coordinate system, it differs significantly because the threads are not an ordered set. In a one dimensional setting, this would mean that the thread with identifier “(1)” does not need to be adjacent to thread “(2)”, making it impossible to rely on an ordering in the algorithm design.

Hardware resource allocation is the second mayor aspect of the thread hierarchy, which is where blocks and grids come into the equation. Their main purpose is to allocate threads to the hardware constrained partitioning of the device memory pool.

Blocks and grids

Grids are the highest level of the thread hierarchy, and as such represent the highest level of memory pool partitioning. The massively parallel nature of GPGPU applications allows many of them to efficiently employ multiple GPUs, which requires a method to divide the workload. Grids fill this need by assigning the threads generated by the kernel to a grid, one for each GPU (memory pool) in the system.

Blocks, or cooperative thread arrays (CTA) as the PTX documentation calls them, represent the second level of thread hierarchy and serve to divide the resources of a single GPU. Blocks containing threads are assigned (at runtime) to an available resource slot, which corresponds to an idle GPU core. If multiple blocks are assigned to the same core, they are executed using a time slicing strategy.

Each core of a GPU has a modest amount of local memory, which can be accessed by the threads running on that particular core, but are inaccessible to threads that run on a different core. This shared memory can, among other things, be used to communicate between threads. This local communication path is a lot faster than communication via the (global) GPU memory, making it paramount to assign threads that need to communicate to each other to the same block.

The threads contained in a grid are assigned to a number of blocks. Each block has a unique identifier, which like threads consists of a tuple of (non-negative) integer values with at most three elements. This means that the global thread identification contains at most 7 values; one for the grid identifier, at most three for the block identifier and at most three for the local thread identifier. The division of threads among the blocks of a single grid is not a responsibility of the programmer, as he/she only has implicit control.

OpenCL: “work group” is equivalent to “block” and “NDRange” is equivalent to “grid”.

Streaming Multiprocessors

NVIDIA GPU cores are called streaming multiprocessors, or SM for short. They represent the lowest hardware level of a GPU that can operate independently, which makes them very comparable to a core of a multi-core CPU. They also form the basis of product differentiation, as the computational power of a GPU can be increased or decreased by adding or removing SMs to/from the design. GPUs aimed at the low-end or mobile segment of the market will typically employ 2~4 SM units in their chip, whereas chips aimed at the high-end GPGPU market may contain about 80 streaming multiprocessors.

As stated earlier, GPU microarchitectures can vary significantly, resulting in vastly different specifications for streaming multiprocessors from different generations. This necessitates limiting the scope of the discussion somewhat to the SM design from the Volta generation, referring to the whitepapers of different microarchitectures for the details of their SM design.

Superscalar and pipelining

CPUs are said to be superscalar if they can process more then one instruction per clock tic, this is usually achieved by instruction level parallelism techniques and multiple execution units per core. GPU cores also have superscalar capabilities, but rely more on thread (warp) level parallelism to accomplish this. Each SM from the Volta generation is divided into four processing blocks, each of which has its own instruction dispatcher and set of execution units. This means that a single SM can process up to four warps at the same time, compared to a single thread (disregarding SMT) for CPU cores.

The Volta SM uses pipelining to achieve two goals; increased execution unit utilization and concurrent execution unit operation. Pipelining is a technique where each instruction is divided into multiple distinct sub-instructions (uops) that need to be executed sequentially. Every sub-instruction is processed by a different element of the execution unit, making it possible to increase the utilization of an execution unit by scheduling multiple (independent) sub-instructions to the execution unit at the same clock-tick. The concurrent execution unit operation is achieved by the ability to (sequentially) issue uops for different execution units, before the result of previous uops is available.

Separate but related to pipelining are the various techniques available to manipulate the pipeline, which is the queue of uops waiting to be executed. Concepts like speculative execution and out-of-order execution are available to increase execution unit utilization and are commonplace in CPU designs, but generally lacking in GPU design. The main reasons that GPUs don’t employ those kind of techniques is that they require a lot of silicon realestate and provide the most benefits in situations where code branching is common. Typical GPU workloads don’t have a lot of code branching, which make makes the silicon realestate better spent on additional execution units. Many external references will state that GPUs don’t support pipelining, when they actually mean that they don’t support (some of) the pipeline manipulation techniques.

SMT and latency

GPUs are massively parallel, high throughput and high latency devices. The first two properties are generally considered positive, but high latency has absolutely no benefits in any situation what so ever. As a result, a lot of effort involved in GPU design is centered around latency hiding techniques. CPUs aim to reduce latency with extensive caching algorithms, hoping that the data is in fast cache when it is required. GPU designs optimize for throughput rather than latency, and as such will always have to deal with significant latency. Their approach only hides the effects of latency, without actually reducing it.

The problem of memory latency is twofold. Firstly, a task takes longer to complete if it needs to wait on data. Secondly, hardware resources like execution units are idle when they are starved by a lack of data, causing sub optimal utilization. The first aspect is only really problematic when subsequent tasks are dependent on the completion of the first task. Otherwise, performing different tasks while waiting for the result of the first task still results in high overall throughput. However, the suggested solution to the first problem is contingent to the possibility of employing hardware resources that would otherwise be unavailable. As it happens, the second aspect of latency implies that hardware resources are idle when waiting for data. This means that the problem is solved when a mechanism is in place that could assign the idle hardware to different tasks when the initial task is waiting for data, having effectively hidden the memory latency.

Context switching, provided by simultaneous multi threading (SMT), is exactly such a mechanism and provides the basis for most of the latency hiding techniques found on GPUs. The operation of SMT is not inherently different on a GPU compared to a CPU, but it is employed on a much grander scale. Typical consumer CPUs will provide SMT that makes fast context switching between two threads possible, whereas NVIDIA GPU cores can have as many as 64 warps in flight at the same time. Increasing the number of tracked contexts from two to 64 dramatically improves the odds that at least one of the contexts is not waiting for data, enhancing SMTs capabilities as a latency hiding technique.

CUDA cores and execution units

Marketing documentation and microarchitecture whitepapers provided by NVIDIA often state the amount of CUDA cores that a particular GPU or SM contains. The nomenclature is a bit misleading, because CUDA cores are very different from GPU cores. More generally, NVIDIA very rarely provides much implementation details, referring to abstract concepts like CUDA cores and threads for their description of the hardware, rather than conventional “physical” concepts like SIMD lane and SIMD lane instruction. This makes it difficult to analyze the hardware, requiring educated guesses to fill the gaps in the documentation.

One such gap in the documentation of NVIDIA is the partitioning of execution units within a processing block of a streaming multiprocessor. It may be that the equivalent SIMD lanes are all part of the same execution unit, but it could also be possible that they are divided over multiple execution units. The implication of such a division would be increased flexibility, because multiple instructions could be processed simultaneously. Addressing the situation specifically for the Volta SM, my educated guess would be that all the equivalent SIMD lanes belong to the same execution unit. This is based on two clues provided by the NVIDIA documentation:

Assuming that this guess is correct, the following Volta SM description would be (reasonably) accurate:

This list obviously excludes a lot of items, but includes the most relevant parts to floating point mathematics.

Execution unit limitations

Even though modern GPUs are capable of running GPGPU applications, they are not as flexible as CPUs. A lot of this is due to the high level architecture of GPUs, but some limitations are a direct consequence of low level functionality. Case in point being the less flexible execution units. It should be noted that this generally holds true for most GPUs, but the details provided in this text once again focus on the NVIDIA Volta microarchitecture.

Some of the most obvious limitations are centered around FP16 and FP64 capabilities. Consumer GPUs from NVIDIA have significantly reduced FP16/FP64 performance compared to their GPGPU oriented counterparts. Reduced performance in those areas doesn’t hurt the intended workflow (gaming), but aids in market segmentation. The much reduced FP16/FP64 capabilities are only in place to provide compatibility.

NVIDIA uses two different kinds of FP32 execution units. The most common one is related to the CUDA core and exclusive to the consumer level GPUs, whereas the GPGPU products contain the other kind of FP32 execution units. Both of these provide FP16 capabilities, but do so in very different ways. The “GPGPU FP32 units” use a technique called SWAR (speculation) to perform two FP16 operations per SIMD lane. The resulting FP16 performance is thus double that of FP32. Ever since the Pascal microarchitecture, “consumer FP32 units” have a single SIMD lane that behaves like the “GPGPU FP32 units”, all the other SIMD lanes lack the SWAR capabilities, resulting in said reduced performance.

As a relatively recent development (at the time of writing), FP32 execution units are no longer responsible for INT32 operations. Microarchitectures prior to NVIDIA Volta had FP32 execution units that provided INT32 support, thus lacking any dedicated INT32 execution units. The benefit of this approach is that the silicon realestate that INT32 execution units occupy can be used for more FP32 SIMD lanes. The disadvantages are that each FP32 SIMD lane requires more silicon realestate to accommodate the INT32 capabilities, but more important is the fact that pointer arithmetic cannot be processed in parallel with the floating point arithmetic, which is an obvious disadvantage for GPGPU applications.

The FP64 capabilities are extremely poorly documented by NVIDIA. Consumer oriented GPUs have FP64 capabilities, but no implementation information has been released by NVIDIA since the Kepler microarchitecture. It is reasonable to assume that some form of hardware is responsible for FP64 calculations on consumer GPUs, but this has not been verified. The documentation is a lot better for the GPGPU oriented products, but still far from satisfactory. The biggest gap in the FP64 documentation is that it is indicated that the FP64 execution units cannot execute instructions in conjunction with other execution units, but no explanation or any details are provided. My guess would be that the power consumpsion of the FP64 hardware is too high to allow for the power draw of other execution units within the power/thermal budget.

Memory

GPUs are massively parallel, high throughput devices. This means that they are (in principle) very well suited for the needs of scientific computing, where a small set of operations needs to be applied to a large set of data. These type of workloads have huge implications for the memory subsystem of a GPU, mainly because it needs to have a lot of high bandwidth memory in order to prevent data starvation of the execution units.

CPUs also benifit from very high bandwidth memory, but require far less of it to prevent data starvation, caused mostly by the high ratio of operations per unit of data for the intended workload of CPUs. This made it feasable to have a small quantity of very fast memory, both in terms of bandwidth and latency, which is called cache. The second tier of the memory subsystem provides a lot of capacity at the cost of bandwith and latency, having in effect a hybrid memory subsystem.

Such a hybrid approach is far less valuable for GPUs, because a cache that would prevent data starvation to a satisfactory degree would be far too large to be feasable. The only remaining option is to create a memory subsystem that consists of a lot of very high bandwidth memory, sacrificing latency and to some extend capacity to remain acceptable in terms of costs. This does not mean that GPUs don’t have cache, because they do. However, it is configured much more like a small data queue. The exact structure of the cache will not be discussed, bcause its ramifications for algorithm design are much less stringent for GPGPU applications. Meaning that properly designed GPGPU algorithms should not hinge on low latency memory access provided by cache.

DRAM type and memory controller

Because the design criteria of GPU memory are very different than those of CPU memory, the type of DRAM used is also generally very different. Typical GPU DRAM is of the GDDR type, with common variants (at the time of writing) being; GDDR5, GDDR5X and GDDR6. GDDR is loosly based on conventional DDR memory, which is employed for CPUs, with the main difference that the amount of bits transferred per clock-tick is a lot higher, at the expense of latency.

The second type of DRAM used on GPUs is called High Bandwidth Memory (HBM), with the most common variant at the moment being HBM2. It represents a sizable departure from conventional DDR and GDDR, which allows it to achieve much higher bandwidth and somewhat lower latency then GDDR. The downside of HBM is that it is a lot more expensive, causing it to only be employed in GPUs aimed at the top end of the market. As an example, the NVIDIA Tesla V100 is available with up to 32GB of HBM2 memory, whereas the NVIDIA RTX 2080ti is only available with 11GB of GDDR6 memory.

The memory controllers of GPUs are also different than those of CPUs, but those technical details have very little effect on algorithm design criteria. Suffice it to say that they generally employ much more memory channels than their CPU counterparts. 8 channels or more are not uncommon for consumer grade GPUs, even going as high as 12 in the NVIDIA RTX 2080ti. The NVIDIA Tesla V100 only uses 8 channels to communicate with the four HBM2 memory stacks it has on board, but still manages to generously outperform the RTX 2080ti. The RTX 2080 ti delivers 616.0 GB/s and the V100 reaches 897.0 GB/s.

Operating frequency

The operating frequency of GPUs is a lot lower than that of CPUs. The driving force behind the very high operating frequency of CPUs is single thread performance, which is obviously much less important for GPUs. It is therefor more efficient to use a lower clock frequency, causing a much reduced power consumption per GPU core, facilitating more cores within the same power budget. GPUs also support a turbo freqency, allowing for a short term increase in the power budget, very similar to CPU turbo.

Interfaces

GPUs have many interfaces, but most of them have to do with display functionality, making them irrelevant to GPGPU applications. The most important interfaces are those between the GPU memory pool and the CPU memory pool, which come (at the time of writing) in the form of either PCI-e or NVLink (proprietairy to NVIDIA).

NVLink is an alternative to PCI-e, aimed at providing much more bandwidth. The second difference between NVLink and PCI-e is that NVLink is peer-to-peer, dropping the requirement of a central controller. This allows for much more convenient data transfers between multiple GPUs. The biggest disadvantage is its low adoption rate. This is because NVLink is only available for high cost NVIDIA GPUs and certain IBM Power CPUs. The result is that it only has a sizable market penetration in the super computer space. Personal computers or servers equiped with NVLink are few and far between, making it much less reasonable to design algorithms that are contingent on the performance benifits that NVLink provides.