This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revision | |||
| en:multiasm:paarm:chapter_5_12 [2025/12/08 10:31] – eriks.klavins | en:multiasm:paarm:chapter_5_12 [2025/12/08 10:33] (current) – eriks.klavins | ||
|---|---|---|---|
| Line 9: | Line 9: | ||
| Processors today can execute many instructions in parallel using pipelining and multiple functional units. These techniques allow the reordering of instructions internally to avoid pipeline stalls (Out-of-Order execution), branch prediction to guess the branching path, and others. Without speculation, | Processors today can execute many instructions in parallel using pipelining and multiple functional units. These techniques allow the reordering of instructions internally to avoid pipeline stalls (Out-of-Order execution), branch prediction to guess the branching path, and others. Without speculation, | ||
| + | |||
| + | |||
| + | |||
| + | ===== Speculative instruction execution ===== | ||
| + | |||
| + | Let's start with an explanation of how speculation works. The pipeline breaks down the whole instruction into small microoperations. The first microoperation (first step) is to fetch the instruction from memory. The second step is to decode the instruction; | ||
| + | ''< | ||
| + | ''< | ||
| + | ''< | ||
| + | The possible outcomes are shown in the picture below. | ||
| + | {{: | ||
| + | |||
| + | In the example above, the comparison is made on the ''< | ||
| + | |||
| + | From the architectural point of view, the speculations with instructions are invisible, as if those instructions never ran. But from a microarchitectural perspective (cache contents, predictors, buffers), all speculation leaves traces. Regarding registers, speculative updates remain in internal buffers until commit. No architectural changes happen until then. Regarding memory, speculative stores are not visible to other cores or devices and remain buffered until committed. But the speculative loads can occur. They may bring the data into cache memory even if it’s later discarded. | ||
| + | |||
| + | In this example, the AMR processor will perform speculative memory access:\\ | ||
| + | ''< | ||
| + | ''< | ||
| + | If the registers ''< | ||
| + | |||
| + | ===== Barriers(instruction synchronization / data memory / data synchronization / one way BARRIER) ===== | ||
| + | |||
| + | Many processors today can execute instructions out of the programmer-defined order. This is done to improve performance, | ||
| + | |||
| + | The barrier instructions enforce instruction order between operations. This does not matter whether the processor has a single core or multiple cores; these barrier instructions ensure that the data is stored before the next operation with them, that the previous instruction result is stored before the next instruction is executed, and that the second core (if available) accesses the newest data. ARM has implemented special barrier instructions to do that: ''< | ||
| + | |||
| + | Since the instructions are prefetched and decoded ahead of time, those earlier fetched and executed instructions might not yet reflect the newest state. The ''< | ||
| + | ''< | ||
| + | ''< | ||
| + | ''< | ||
| + | ''< | ||
| + | |||
| + | The code example ensures that the control register settings are stored before the following instructions are fetched. The setting enables the Memory Management Unit and updates new address translation rules. The following instructions, | ||
| + | Sometimes the instructions may sequentially require access to memory, and situations where previous instructions store data in memory and the next instruction is intended to use the most recently stored data. Of course, this example is to explain the data synchronisation barrier. Such an example is unlikely to occur in real life. Let's imagine a situation where the instruction computes a result and must store it in memory. The following instruction must use the result stored in memory by the previous instruction. | ||
| + | |||
| + | The first problem may occur when the processor uses special buffers to maximise data throughput to memory. This buffer collects multiple data chunks and, in a single cycle, writes them to memory. The buffer is not visible to the following instructions so that they may use the old data rather than the newest. The data synchronisation barrier instruction will solve this problem and ensure that the data are stored in memory rather than left hanging in the buffer.\\ | ||
| + | **Core 0:** \\ | ||
| + | ''< | ||
| + | ''< | ||
| + | ''< | ||
| + | |||
| + | **Core 1:**\\ | ||
| + | ''< | ||
| + | ''< | ||
| + | ''< | ||
| + | In the example, the data memory barrier will ensure that the data is stored before the function status flag. The DMB and DSB barriers are used to ensure that data are available in the cache memory. And the purpose of the shared domain is to specify the scope of cache consistency for all inner processor units that can access memory. These instructions and parameters are primarily used for cache maintenance and memory barrier operations. The argument that determines which memory accesses are ordered by the memory barrier and the Shareability domain over which the instruction must operate. This scope effectively defines which Observers the ordering imposed by the barriers extends to. | ||
| + | |||
| + | In this example, the barrier takes the parameter; the domain “ISH” (inner sharable) is used with barriers meant for inner sharable cores. That means the processors (in this area) can access these shared caches, but the hardware units in other areas of the system (such as DMA devices, GPUs, etc.) cannot. The “''< | ||
| + | These domain have their own options. These options are available in the table below. | ||
| + | <table tab_label> | ||
| + | < | ||
| + | ^ Option ^ Order access (before-after) ^ Shareability domain ^ | ||
| + | | OSH | Any-Any | Outer shareable | | ||
| + | | OSHLD | Load-Load, Load-Store | Outer shareable | | ||
| + | | OSHST | Store-Store | Outer shareable | | ||
| + | | NSH | Any-Any | Non-shareable | | ||
| + | | NSHLD | Load-Load, Load-Store | Non-shareable | | ||
| + | | NSHST | Store-Store | Non-shareable | | ||
| + | | ISH | Any-Any | Inner shareable | | ||
| + | | ISHLD | Load-Load, Load-Store | Inner shareable | | ||
| + | | ISHST | Store-Store | Inner shareable | | ||
| + | | SY | Any-Any | Full system | | ||
| + | | ST | Store-Store | Full system | | ||
| + | | LD | Load-Load, Load-Store | Full system | | ||
| + | </ | ||
| + | |||
| + | Order access specifies which classes of accesses the barrier operates on. A " | ||
| + | |||
| + | All these barriers are practical with high-level programming languages, where unsafe optimisation and specific memory ordering may occur. In most scenarios, there is no need to pay special attention to memory barriers in single-processor systems. Although the CPU supports out-of-order and predictive execution, in general, it ensures that the final execution result meets the programmer' | ||
| + | * Share data between multiple CPU cores. Under the weak consistency memory model, a CPU's disordered memory access order may cause contention for access. | ||
| + | * Perform operations related to peripherals, | ||
| + | * Modifying the memory management strategy, such as context switching, requesting page faults, and modifying page tables. | ||
| + | In short, the purpose of using memory barrier instructions is to ensure the CPU executes the program' | ||
| + | |||
| + | ===== Conditional instructions ===== | ||
| + | |||
| + | Meanwhile, speculative instruction execution consumes power. If the speculation gave the correct result, the power wasn’t wasted; otherwise, it was. Power consumption must be taken into account when designing the program code. Not only is power consumption essential, but so is data safety. The Cortex-A76 on the Raspberry Pi 5 has a very advanced branch predictor, but mispredictions still cause wasted instructions, | ||
| + | |||
| + | The use of the conditional select (''< | ||
| + | ''< | ||
| + | ''< | ||
| + | |||
| + | This conditional instruction writes the value of the first source register to the destination register if the condition is TRUE. If the condition is FALSE, it writes the value of the second source register to the destination register. So, if the ''< | ||
| + | |||
| + | Other conditional instructions can be used similarly: | ||
| + | |||
| + | {{: | ||
| + | |||
| + | These conditional instructions are helpful in branchless conditional checks. Taking into account that these instructions can also be executed speculatively, | ||
| + | |||