This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:multiasm:papc:chapter_6_16 [2025/11/20 09:25] – [Further research] ktokarz | en:multiasm:papc:chapter_6_16 [2025/11/25 12:49] (current) – removed ktokarz | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== Optimisation (DRAFT) ====== | ||
| - | Optimisation strongly depends on the microarchitecture of the processor. Some optimisation recommendations change together with new versions of processors. Producers usually publish the most up-to-date recommendations. The last release of the Intel documentation is " | ||
| - | A selection of specific optimisation recommendations is described in this section. | ||
| - | ===== The use of inc and dec instructions ===== | ||
| - | It is natural for programmers to use **inc** or **dec** instructions to increment or decrement the variable. They are simple and appear to be executed faster than addition and subtraction with a constant " | ||
| - | |||
| - | ===== Versions of logic instructions ===== | ||
| - | While new extensions are introduced, several new instructions appear. In addition to advanced data processing instructions, | ||
| - | |||
| - | |||
| - | ===== Data placement ===== | ||
| - | It is recommended to place variables in the memory at their natural boundaries. It means that if the data is 16 bytes, the address should be evenly divisible by 16. For 8-byte data, the address should be divisible by 8. | ||
| - | |||
| - | ===== Registers use ===== | ||
| - | It is recommended to use registers instead of memory for scalar data if possible. Keeping data in registers eliminates the need to load and store it in memory. | ||
| - | |||
| - | ===== Cache temporal locality ===== | ||
| - | The term temporal locality refers to the fact that if data is used, it remains in a cache for a certain amount of time until other data is loaded into the cache. It is efficient to keep data in a cache instead of reloading it. This feature helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop. | ||
| - | In a situation where the data processed exceeds half the size of a level 1 cache, it is recommended to use the non-temporal data move instructions **movntq** and **movntdq** to store data from registers to memory. These instructions are hints to the processor to omit the cache if possible. It doesn' | ||
| - | ===== Cache support instructions ===== | ||
| - | In modern microarchitectures, | ||
| - | |||
| - | |||
| - | There are also instructions which allow the programmer to support the processor with cache utilisation. | ||
| - | |||
| - | MOVNTQ - saving the contents of the MMX register bypassing cache | ||
| - | MOVNTPS - write the contents of the SSE register bypassing cache | ||
| - | MASKMOVQ - write selected bytes from the MMX register bypassing cache | ||
| - | SFENCE - force the memory – cache synchronization | ||
| - | PREFETCH - a hint to the processor, | ||
| - | clflush - Flushes a Cache Line from all levels of cache. | ||
| - | lfence - Guarantees that all memory loads issued before the lfence instruction are completed before any loads after the lfence instruction. | ||
| - | mfence - Guarantees that all memory reads and writes issued before the mfence instruction are completed before any reads or writes after the mfence instruction. | ||
| - | pause - Pauses execution for a set amount of time. | ||
| - | |||
| - | movntdqa - Non-temporal aligned move. Load hint instruction. | ||
| - | |||
| - | |||
| - | ===== Cache temporal locality ===== | ||
| - | The term temporal locality refers to the fact that if data is used, it remains in a cache for a certain amount of time until other data is loaded into the cache. It is efficient to keep data in a cache instead of reloading it. This feature helps improve performance in situations where the program uses the same variables repeatedly, e.g. in a loop. | ||
| - | In a situation where the data processed exceeds half the size of a level 1 cache, it is recommended to use the non-temporal data move instructions **movntq** and **movntdq** to store data from registers to memory. These instructions are hints to the processor to omit the cache if possible. It doesn' | ||
| - | |||
| - | ===== Further research ===== | ||
| - | The essential readings in an optimisation topic are vendors' | ||
| - | |||
| - | Interesting position about optimisation in x64 processors is by Agner Fog((https:// | ||
| - | |||
| - | Assembly tutorial ((https:// | ||