This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:multiasm:papc:chapter_6_16 [2025/11/22 11:42] – [Pause instruction] ktokarz | en:multiasm:papc:chapter_6_16 [2025/11/25 12:49] (current) – removed ktokarz | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== Optimisation (DRAFT) ====== | ||
| - | Optimisation strongly depends on the microarchitecture of the processor. Some optimisation recommendations change together with new versions of processors. Producers usually publish the most up-to-date recommendations. The last release of the Intel documentation is " | ||
| - | A selection of specific optimisation recommendations is described in this section. | ||
| - | ===== The use of inc and dec instructions ===== | ||
| - | It is natural for programmers to use **inc** or **dec** instructions to increment or decrement the variable. They are simple and appear to be executed faster than addition and subtraction with a constant " | ||
| - | |||
| - | ===== Versions of logic instructions ===== | ||
| - | While new extensions are introduced, several new instructions appear. In addition to advanced data processing instructions, | ||
| - | |||
| - | |||
| - | ===== Data placement ===== | ||
| - | It is recommended to place variables in the memory at their natural boundaries. It means that if the data is 16 bytes, the address should be evenly divisible by 16. For 8-byte data, the address should be divisible by 8. | ||
| - | |||
| - | ===== Registers use ===== | ||
| - | It is recommended to use registers instead of memory for scalar data if possible. Keeping data in registers eliminates the need to load and store it in memory. | ||
| - | |||
| - | ===== Pause instruction ===== | ||
| - | It is a common method to pause the program execution and wait for an event for a short period in a spin loop. In case of a brief waiting period, this method is more efficient than calling an operating system function, which waits for an event. In modern processors, the **pause** instruction should be used inside such a loop. It helps the internal mechanisms of the processor by allocating hardware resources for a while to another logical processor. | ||
| - | |||
| - | ===== Cache temporal locality ===== | ||
| - | The term temporal locality refers to the fact that if data is used, it remains in a cache for a certain amount of time until other data is loaded into the cache. It is efficient to keep data in a cache instead of reloading it. This feature helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop. | ||
| - | In a situation where the data processed exceeds half the size of a level 1 cache, it is recommended to use the non-temporal data move instructions **movntq** and **movntdq** to store data from registers to memory. These instructions are hints to the processor to omit the cache if possible. It doesn' | ||
| - | ===== Cache support instructions ===== | ||
| - | In modern microarchitectures, | ||
| - | |||
| - | |||
| - | There are also instructions which allow the programmer to support the processor with cache utilisation. | ||
| - | * **movntq** saving the contents of the MMX register, bypassing cache | ||
| - | * **movntps** write the contents of the SSE register, bypassing cache | ||
| - | * **maskmovq** write selected bytes from the MMX register, bypassing cache | ||
| - | * **movntdqa** non-temporal aligned move | ||
| - | |||
| - | Fence instructions guarantee that the load and/or store instructions before the fence are completed before the corresponding instruction after the fence. | ||
| - | * **spence** force the memory–cache synchronisation after store instructions | ||
| - | * **lfence** force the memory–cache synchronisation after load instructions | ||
| - | * **mfence** force the memory–cache synchronisation after load and store instructions | ||
| - | |||
| - | * **prefetch** a hint to the processor, | ||
| - | * **clflush** flushes a Cache Line from all levels of cache. | ||
| - | |||
| - | |||
| - | |||
| - | |||
| - | |||
| - | ===== Cache temporal locality ===== | ||
| - | The term temporal locality refers to the fact that if data is used, it remains in a cache for a certain amount of time until other data is loaded into the cache. It is efficient to keep data in a cache instead of reloading it. This feature helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop. | ||
| - | In a situation where the data processed exceeds half the size of a level 1 cache, it is recommended to use the non-temporal data move instructions **movntq** and **movntdq** to store data from registers to memory. These instructions are hints to the processor to omit the cache if possible. It doesn' | ||
| - | |||
| - | ===== Further reading ===== | ||
| - | The essential readings in an optimisation topic are the vendors' | ||
| - | |||
| - | An exceptional position about optimisation in x64 processors is by Agner Fog((https:// | ||
| - | |||
| - | Interesting Understanding Windows x64 Assembly tutorial ((https:// | ||