This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:multiasm:cs:chapter_3_9 [2025/01/07 12:12] – [Superscalar] ktokarz | en:multiasm:cs:chapter_3_9 [2025/12/12 08:28] (current) – [Modern Processors: Pipeline, Superscalar, Branch Prediction, Hyperthreading] ktokarz | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== Modern Processors: Pipeline, Superscalar, | + | ====== Modern Processors: |
| Modern processors have a very complex design and include many units responsible mainly for shortening the execution time of the software. | Modern processors have a very complex design and include many units responsible mainly for shortening the execution time of the software. | ||
| + | |||
| + | ===== Cache ===== | ||
| + | |||
| + | Cache memory forms a layer in the memory hierarchy that intermediates between the main memory and the processor registers. The main reason for the introduction of cache memory is that main memory based on DRAM technology is much slower than the processor, which is based on static technology. The cache exploits two software features: spatial locality and temporal locality. Spatial locality results from the fact that the processor executes code, which in most cases is a sequence of instructions arranged directly behind each other. Temporal locality arises because programs often run in loops, repeatedly working on a single set of data over short intervals. In both cases, a larger fragment of a program or data can be loaded into the cache and operated on without having to access the main memory each time. Main memory is designed to significantly speed up reading and writing data in blocks compared to accessing random addresses. These properties allow a code fragment to be read in its entirety from main memory into the cache and executed without the need to access RAM for each instruction separately. In the case of data, the processor performs calculations after reading a block into the cache and then stores the results in a single write sequence. | ||
| + | |||
| + | In modern processors, the cache is divided into several levels, usually three. The first-level cache (L1) is the closest to the processor, the fastest, and is usually divided into separate instruction and data caches. The second-level cache (L2) is shared, slower and usually larger than the L1 cache. The largest and the slowest is the third-level cache (L3). It is closest to the computer' | ||
| + | |||
| + | Besides the size, important parameters of cache are the line length and associativity. | ||
| + | Length of the line is usually expressed in bytes. It tells how many bytes are stored in a single, smallest possible data fragment. It also determines at what addresses such a data fragment starts in main memory. For example, if the cache line length is 64 bytes and memory is byte-organised, | ||
| + | Associativity tells how many cache lines can be used to store the block from a specific address. If the block can go to any cache line, the cache is fully associative. If there is only one possible location, the cache is named direct-mapped. A fully associative cache is more flexible, but complex and expensive. Direct-mapped cache is simple, but it can cause data conflicts if two blocks of memory which should go to the same cache line need to be loaded. In real processors, the compromise solution is often implemented, | ||
| ===== Pipeline ===== | ===== Pipeline ===== | ||
| - | As was described in the previous chapter, executing a single instruction requires many actions which must be performed by the processor. We could see that each step, or even substep, can be performed by a separate logical unit. This feature has been used by designers of modern processors to create a processor in which instructions are executed in a pipeline. A pipeline is a collection of logical units that execute many instructions at the same time - each of them at a different stage of execution. If the instructions arrive in a continuous stream, the pipeline allows the program to execute faster than a processor that does not support the pipeline. Note that the pipeline does not reduce the time of execution of a single instruction, it increases the throughput of the instruction stream. | + | As was described in the previous chapter, executing a single instruction requires many actions which must be performed by the processor. We could see that each step, or even substep, can be performed by a separate logical unit. This feature has been used by designers of modern processors to create a processor in which instructions are executed in a pipeline. A pipeline is a collection of logical units that execute many instructions at the same time - each of them at a different stage of execution. If the instructions arrive in a continuous stream, the pipeline allows the program to execute faster than a processor that does not support the pipeline. Note that the pipeline does not reduce the time of execution of a single instruction. It increases the throughput of the instruction stream. |
| - | A simple pipeline is implemented in AVR microcontrollers. It has two stages, which means that while one instruction is executed another one is fetched as shown in Fig {{ref> | + | A simple pipeline is implemented in AVR microcontrollers. It has two stages, which means that while one instruction is executed, another one is fetched as shown in Fig {{ref> |
| <figure pipelineavr> | <figure pipelineavr> | ||
| - | {{ : | + | {{ : |
| < | < | ||
| </ | </ | ||
| Line 17: | Line 27: | ||
| </ | </ | ||
| - | Modern processors implement longer pipelines. For example, Pentium III used the 10-stage pipeline, Pentium 4 20-stage, and Pentium 4 Prescott even a 31-stage pipeline. Does the longer pipeline mean faster program execution? Everything has benefits and drawbacks. The undoubted benefit of a longer pipeline is more instructions executed at the same time which gives the higher instruction throughput. But the problem appears when branch instructions come. While in the instruction stream | + | Modern processors implement longer pipelines. For example, Pentium III used the 10-stage pipeline, Pentium 4 20-stage |
| ===== Superscalar ===== | ===== Superscalar ===== | ||
| - | The superscalar processor increases the speed of program execution because it can execute more than one instruction during a clock cycle. It is realised by simultaneously dispatching instructions to different execution units on the processor. The superscalar processor doesn' | + | The superscalar processor increases the speed of program execution because it can execute more than a single |
| <figure superscalar> | <figure superscalar> | ||
| Line 28: | Line 38: | ||
| </ | </ | ||
| - | In the x86 family first processor with two paths of execution was Pentium with U and V pipelines. Modern x64 processors like i7 implement six execution units. Not all execution units have the same functionality, for example, | + | In the x86 family first processor with two paths of execution was the Pentium with two execution units called |
| <table executionunits> | <table executionunits> | ||
| Line 41: | Line 51: | ||
| </ | </ | ||
| - | <note info> | + | |
| - | The real path of instruction processing is much more complex. Additional techniques are implemented to achieve better performance e.g. out-of-order execution, and register renaming. They are performed automatically by the processor and the assembler programmer does not influence their behaviour. | + | |
| - | </ | + | |
| ===== Branch prediction ===== | ===== Branch prediction ===== | ||
| - | Branch Prediction | + | As it was mentioned, the pipeline can suffer invalidation if the conditional branch is not properly predicted. The branch prediction unit is used to guess the outcome of conditional branch instructions. It helps to reduce delays in program execution by predicting |
| + | There are many methods of predicting the branches. In general, the processor implements the buffer with the addresses of the last few branch instructions with a history register for every branch. Based on history, the branch prediction unit can guess if the branch should be taken. | ||
| + | ===== Hyperthreading ===== | ||
| + | Hyper-Threading Technology is an Intel approach to simultaneous multithreading technology, which allows the operating system to execute more than one thread on a single physical core. | ||
| + | For each physical core, the operating system defines two logical processor cores and shares the load between them when possible. The hyperthreading technology uses a superscalar architecture to increase the number of instructions that operate in parallel in the pipeline on separate data. With Hyper-Threading, | ||
| + | <note info> | ||
| + | The real path of instruction processing is much more complex. Additional techniques are implemented to achieve better performance, | ||
| + | </ | ||