At the point of personal computers' evolution, it became clear that they would be used not only for professional use, for example, in companies, financial institutions, and education, but also would be used as centres of home entertainment systems, enabling users to play games, watch videos, and listen to music. This led to empowering processors with the ability to process multimedia data. As the stereo sound has the form of a series of samples, and pictures are often represented by a matrix of three colour pixels, the method of improving the performance of multimedia processing is to introduce parallelism. At the processor level, the answer is SIMD - Single Instruction Multiple Data, which allows the execution unit to perform the same operation on many data units at the same time. Speaking more formally, one stream of instructions performs operations on many data streams. The first SIMD instructions introduced in the x86 family that follow this idea are MMX - MultiMedia eXtension.
MMX set of instructions operates on 64-bit packed data types. Packed means that the 64-bit data can contain 8 bytes, 4 words, or 2 doublewords. Based on this, the new data types were defined. Packed data types are also called vectors. Please refer to the section “Integer vector data types” for details. The MMX instructions operate using eight 64-bit registers named MM0 - MM7.
To copy data from memory or between registers, two new data transfer instructions were introduced. The movd instruction allows copying 32 bits of data between MMX registers and memory or between MMX registers and general-purpose registers of the main processor. The movq instruction allows copying 64 bits of data between MMX registers and memory or between two MMX registers. In all MMX instructions except data transfer, the first operand, which is a destination operand, is always an MMX register.
The main idea of vector data processing is shown in figure 1. It shows the example of an operation performed with packed word vector data.
When performing arithmetic operations, the main processor stores additional information in flags in the FLAG register. The MMX unit does not have flags for each calculated result, so some other approach should be used. Key information for arithmetic operations is the carry when adding and the borrow when subtracting. The simplest solution is to omit the carry, which, if the maximum value is exceeded, will result in truncation of the oldest bits and a reduction in the result. In the case of subtraction, the situation is reversed, and the resulting value will be larger than expected. For multimedia operations, a better solution is to limit the result to a maximum or minimum value. This approach is called saturation and comes in signed and unsigned versions. This means that, for example, when a pixel reaches its highest brightness, it will no longer be brightened. This way, information about brightness differences is lost, but the resulting image looks natural. Let's consider the addition operation on four arguments of the size of a word in three versions. In figure 2, the packed word addition with wraparound paddw is shown. In figure 3, the packed word addition with signed saturation paddsw is presented, and finally, the packed word addition with unsigned saturation paddusw is shown in figure 4.
The last letter in the instruction specifies the size of arguments and results. MMX addition and subtraction instructions are shown in the table 1
| Mnemonic | operation | argument size | overflow management |
|---|---|---|---|
| paddb | addition | 8 bytes | wraparound |
| paddw | addition | 4 words | wraparound |
| paddd | addition | 2 doublewords | wraparound |
| paddsb | addition | 8 bytes | signed saturation |
| paddsw | addition | 4 words | signed saturation |
| paddusb | addition | 8 bytes | unsigned saturation |
| paddusw | addition | 4 words | unsigned saturation |
| psubb | subtraction | 8 bytes | wraparound |
| psubw | subtraction | 4 words | wraparound |
| psubd | subtraction | 2 doublewords | wraparound |
| psubsb | subtraction | 8 bytes | signed saturation |
| psubsw | subtraction | 4 words | signed saturation |
| psubusb | subtraction | 8 bytes | unsigned saturation |
| psubusw | subtraction | 4 words | unsigned saturation |
Multiplication operation requires twice as much space for the result as the size of the arguments. The solution for this issue is to split the operation into two multiplication instructions, storing the higher and lower halves of the results. For vectors of words, the instructions are pmulhw (packed multiply with storing higher halves of the results) and pmullw (packed multiply with storing lower halves of the results), respectively. Later halves can be joined together, forming full results with unpack instructions. To unpack the higher halves of arguments, the punpckhwd (unpack from higher halves words to doublewords) instruction can be executed. For lower halves, the punpcklwd instruction can be used. The whole algorithm is shown in figure 5.
The code that calculates the presented multiplication can look as follows:
Numbers DW 01ACh, 2112h, 03F3h, 00A4h, 0006h, 0137h, 0AB7h, 00D8h LEA ESI, Numbers MOVQ mm0, [ESI] ; mm0 = 00A4 03F3 2112 01AC MOVQ mm1, [ESI+8] ; mm1 = 00D8 0AB7 0137 0006 MOVQ mm2, mm0 PMULLW mm0, mm1 ; mm0 = 8A60 50B5 2CDE 0A08 PMULHW mm1, mm2 ; mm1 = 0000 002A 0028 0000 MOVQ mm2, mm0 PUNPCKLWD mm0, mm1 ; mm0 = 0028 2CDE 0000 0A08 PUNPCKHWD mm2, mm1 ; mm2 = 0000 8A60 002A 50B5
The MMX set of instructions also contains the pmaddwd (multiply and add packed words to doublewords) instruction. It computes the products of the corresponding signed word operands. The four intermediate doubleword products are summed in pairs to produce two doubleword results. Its behaviour is shown in figure 6. This instruction can simplify the multiplication process in the case of multiplying two pairs of word arguments, while the other two pairs give results of zero.
The set of comparison instructions allows for comparing values in two vectors. The result is stored as a mask of bits, with all ones at the element of the vector where the comparison result is true, and all zeros in the opposite case. There are six compare instructions as shown in table 2.
| Mnemonic | comparison type | argument size |
|---|---|---|
| pcmpeqb | equal | 8 bytes |
| pcmpeqw | equal | 4 words |
| pcmpeqd | equal | 2 doublewords |
| pcmpgtb | greater than | 8 bytes |
| pcmpgtw | greater than | 4 words |
| pcmpgtq | greater than | 2 doublewords |
An example of comparison instruction for equality of two vectors of words is shown in figure 7.
The unpack instructions presented in figure 5 are not the only ones. Unpack instructions of high-order data elements are punpckhbw, punpckhwd, punpckhdq, and for low-order data elements are punpcklbw, punpcklwd, punpckldq. The figure 8 presents unpacking of high-order bytes to words, and figure 9 low-order bytes to words.
The pack instructions are used to shrink the size of arguments and pack them into smaller data. Only three pack instructions are implemented in MMX extension: packsswb - pack words into bytes with signed saturation, packssdw - pack doublewords into words with signed saturation, and packuswb - pack words into bytes with unsigned saturation. The example of pack instruction is shown in figure 10.
Packed shift instructions perform shift operations and elements of the specified size. All elements of the vector are shifted separately. In a logical shift, empty bits are filled with zeros, in arithmetical shift right, the higher bit is copied to preserve the sign of values. There are eight shift instructions, as presented in table 3
| Mnemonic | operation | argument size | type of shift |
|---|---|---|---|
| psllw | shift left | 4 words | logical |
| pslld | shift left | 2 doublewords | logical |
| psllq | shift left | 1 quadword | logical |
| psrlw | shift right | 4 words | logical |
| psrld | shift right | 2 doublewords | logical |
| psrlq | shift right | 1 quadword | logical |
| psraw | shift right | 4 words | arithmetic |
| psrad | shift right | 2 doublewords | arithmetic |
MMX logical instructions operate on the 64-bit data as a whole. They perform bitwise operations as shown in table 4.
| Mnemonic | operation |
|---|---|
| pand | AND |
| pnand | AND NOT |
| por | OR |
| pxor | XOR |
MMX instructions use the same physical registers as the FPU. As a result, mixing PFU and MMX instructions in the same fragment of the code is not possible. Switching between PFU and MMX in a code requires executing the emms instruction, which resets the FPU and MMX units. Fortunately, newer extensions (SSE, AVX) introduce a separate set of registers for improved flexibility.
The SSE is a large group of instructions that implement the SIMD processing towards floating point calculations and increase the size and number of the registers. The abbreviation SSE comes from the name Streaming SIMD Extensions. As the number of instructions introduced in all SSE versions exceeds a few hundred, in this section, we will present the general overview of each SSE version and detailed information on some chosen interesting instructions. The first group of SSE instructions defines a new vector data type containing four single-precision floating-point numbers. It's easy to calculate that it requires the 128-bit registers. These new registers are named XMM0 - XMM7 and are separated from any previously implemented registers, so SSE floating-point operations do not conflict with MMX and FPU.
In modern processors, it is very important to transfer data from and to memory effectively. The memory management unit can perform data transfer much faster if the data is aligned to a specific address. For SSE instructions, an address must be evenly divisible by 16. In the SSE extension, two versions of data transfer instructions were implemented. The movups copies packed single-precision data from any address, while the movaps moves data from an aligned address. The mivss moves the scalar single-precision value. It doesn't have to be aligned. It is also possible to copy data between the upper half of the XMM register and memory with the movhps instruction, between the lower half of the XMM register and memory with the movlps, and from the lower to higher half or from the higher to lower half of the XMM registers with the movhlps and movlhps, respectively. The movmskps instruction copies the most significant bits of single-precision floating-point values to a general-purpose register. It allows us to make a bit mask based on the sign bits of elements of the vector.
The SSE implements the vector and scalar calculations on single-precision floating-point numbers. No prefix for instruction names operating on floating-point numbers was added, but the mnemonic suffix describes the type. PS (packed single) - action on vectors, SS (scalar single) - operation on scalars. If the instructions operate on halves of XMM registers (i.e. either refer to bits 0..63 or 64..127), the instruction mnemonics contain the letter L or H. The idea of vector and scalar operations is shown in figure 11 and figure 12, respectively.
In the SSE extension, mathematical calculations on single-precision floating-point numbers are implemented in both vector (packed) and scalar versions. These instructions are summarised in table 5.
| Mnemonic | operation | argument type |
|---|---|---|
| addps | addition | vector |
| addss | addition | scalar |
| subps | subtraction | vector |
| subss | subtraction | scalar |
| mulps | multiplication | vector |
| mulss | multiplication | scalar |
| divps | division | vector |
| divss | division | scalar |
| rcpps | reciprocal | vector |
| rcpss | reciprocal | scalar |
| sqrtps | square root | vector |
| sqrtss | square root | scalar |
| rsqrtps | reciprocal of square root | vector |
| rsqrtss | reciprocal of square root | scalar |
| maxps | maximum (bigger) value | vector |
| maxss | maximum (bigger) value | scalar |
| minps | minimum (smaller) value | vector |
| minss | minimum (smaller) value | scalar |
Besides math calculations, there are instructions for comparing vector cmpps and scalar cmpss values. As a result, we obtain the all-ones or all-zeros fields as in MMX. The condition of comparison is encoded as the third 8-bit immediate argument. Assemblers usually implement a set of pseudoinstructions which automatically choose the constant value. The scalar version of these pseudoinstructions is presented in table 6
| Pseudoinstruction | operation | instruction |
|---|---|---|
| cmpeqss xmm1, xmm2 | equal | cmpss xmm1, xmm2, 0 |
| cmpltss xmm1, xmm2 | less then | cmpss xmm1, xmm2, 1 |
| cmpless xmm1, xmm2 | less or equal | cmpss xmm1, xmm2, 2 |
| cmpunordss xmm1, xmm2 | unordered | cmpss xmm1, xmm2, 3 |
| cmpneqss xmm1, xmm2 | not equal | cmpss xmm1, xmm2, 4 |
| cmpnltss xmm1, xmm2 | not less then | cmpss xmm1, xmm2, 5 |
| cmpnless xmm1, xmm2 | not less or equal | cmpss xmm1, xmm2, 6 |
| cmpordss xmm1, xmm2 | ordered | cmpss xmm1, xmm2, 7 |
With the use of the comiss instruction, it is possible to compare scalars and set the flags in the FLAG register directly according to the result of the comparison.
There are four logical instructions which operate on all 128 bits of the XMM register. These are andps, andnps, orps, xorps. It is rather clear that the functions are logical and, logical and not, logical or and logical xor, respectively.
Two unpack instructions, unpcklps and unpckhps, operate similarly to unpack instructions known already from MMX. Because the source and destination data are packed single-precision floating-point values, unlike in MMX, these instructions are not used to form longer data types, but change the positions of two elements from two vectors. It is presented in figure 13.
The more universal instruction is shufps. It selects two out of four single-precision values from the source argument and rewrites them to the bottom half of the destination argument. The upper half is filled with two single-precision values from the destination register. Which values will be taken is determined by the third, 8-bit immediate argument. Each two-bit field of the immediate determines the number of packed single values. For 11 - it is X3 or Y3, for 10 - X2 or Y2, for 01 - X1 or Y1 and for 00 - X0 or Y0. It is presented in figure 14.
Together with new data registers, an additional control register appeared in the processor. It is named MXCSR and is similar in meaning of flags to the FPU control register. New instructions are implemented, one to save MXCSR to memory stmxcsr, and one to restore this register from memory ldmxcrs. The instruction fxsave stores the state of both the x87 unit and the SSE extension, and fxrstor restores the state of both the x87 unit and the SSE extension. Some additional MMX instructions were introduced together with the SSE extension. In the SSE, the first set of data conversion instructions was implemented. The summary of all these instructions will be presented in the following chapter. Also, cache supporting instructions were added. These instructions will be described in the chapter on optimisation.
The SSE2 set of instructions implements integer vector operations using the XMM registers. In general, the same instruction mnemonics defined for MMX can be used with XMM registers, offering twice as long vectors. Additionally, the floating-point calculations are complemented with vector operations on the double-precision data type. In XMM registers, vectors of two double-precision values can be processed. SSE2 uses the same software environment (eight 128-bit XMM registers) as SSE. In the SSE2 extension, the denormals-are-zeros mode was introduced. The processor automatically converts all unnormalized floating-point arguments to zero. In such a case, a flag indicating the denormalised argument is not set, and an exception is not raised. The denormals-are-zeros mode is not compliant with IEEE Standard 754, but it allows implementation of faster algorithms for advanced audio and video processing. The arithmetic instructions are similar to SSE, but they possess the suffix of pd - packed double or sd - scalar double instead of ps and ss, respectively.
In figure 15 we present the type conversion instructions. They enable conversion between integer and floating-point data of various sizes and in different registers. The green arrows and instruction nodes represent the conversion from single-precision floating-point to integers, pink represents the conversion from double-precision floating-point to integers, blue represents the conversion from integers to floating points, and orange represents the conversion between single and double precision floating-point.
The SSE3 is a set of 13 instructions. The main innovation in SSE3 is the implementation of horizontal instructions. These instructions perform calculations on the elements of a vector within the same register. There are four such instructions. The haddpd performs horizontal addition of double-precision values, the hsubpd is a horizontal subtraction of double-precision values, the haddps is a horizontal addition of single-precision values, and the hsubps is a horizontal subtraction of single-precision values. All horizontal instructions operate in a similar manner. The lower (bottom) part of the resulting vector is the result of operation on the bottom and top elements of the first (destination) operand; the higher (top) part of the resulting vector is the result of operation on the second (source) operand's bottom and top. The best way to present the principles of horizontal operations is a picture. Because in the subtraction operation the order of arguments is important, the hsubpd instruction is shown in figure 16.
While there are more than two elements of source vectors, like in the hsubps instruction, it is also important to know the order of the elements in the resulting vector. Please look at the figure 17.
This abbreviation stands for Supplemental Streaming SIMD Extension 3. It is a set of 16 instructions introduced in the Core 2 architecture. It implements integer horizontal operations on XMM registers. The principles are the same as in horizontal instructions in SSE3, but instructions can process vectors of doublewords or words. They are summarised in the table 7.
| Instruction | operation | data |
|---|---|---|
| phaddd | addition | unsigned doublewords |
| phaddw | addition | unsigned words |
| phaddsw | saturated addition | signed words |
| phsubd | subtracion | unsigned doublewords |
| phsubw | subtracion | unsigned words |
| phsubsw | saturated subtracion | signed words |
Two data shuffle instructions are worth mentioning. The pshufb instruction makes copies of bytes from the first 128-bit operand based on the control information taken from the second 128-bit operand. Each byte in the control operand determines the resulting byte in the respective position.
The illustration is shown in figure 18.
The palignr instruction combines bytes from two source operands as shown in figure 19. The position of the byte split is specified as third immediate. In the figure, the immediate is equal to 2.
The SSE4 is composed of SSE4.1 and SSE4.2. These groups include instructions supplementing previous extensions. For example, there are eight instructions which expand support for packed integer minimum and maximum determination, or twelve instructions which improve packed integer format conversions with sign extension and zero extension. The dpps and dppd instructions calculate the dot product of four single-precision and two double-precision operands, respectively. Additionally, the arguments are controlled with the third immediate operand. The example showing the dppd is presented in figure 20.
There are also advanced shuffle, insert and extract instructions which make it possible to manipulate positions of the data of various types. A few examples will be shown in the following figures. The insertps inserts a scalar single-precision floating-point value with the position of the vector's element in source and destination controlled with an 8-bit immediate. The example showing the insertps instruction is presented in figure 21. In this example, the immediate contains the bit value of 10011000b.
In SSE4.2, the set of string compare instructions was added. As the XMM registers can contain sixteen bytes, it is much more efficient to implement string processing algorithms with bigger XMM registers than with registers in the main processor with the use of strong instructions. There are four string compare instructions (see table 8), but each of them can be configured to achieve different functionalities. The length of strings can be explicit or implicit. Explicit length means that the length of the first operand is specified with the RAX register, and the length of the second operand is specified with the RDX register. Implicit length means that both operands contain null-terminated strings. Instructions can produce two kinds of results. Index means that the index of the first or last result is returned. Mask means that the bit mask is returned (one bit for each two elements compared) or a mask of the size of the elements (similarly to MMX compare).
| Instruction | length | type of the result |
|---|---|---|
| pcmpestri | explicit | index |
| pcmpestrm | explicit | mask |
| pcmpistri | implicit | index |
| pcmpistrm | implicit | mask |
The third, immediate operand encodes the comparison method and result encoding.
| bits 1:0 | data type |
|---|---|
| 00 | unsigned BYTE |
| 01 | unsigned WORD |
| 10 | signed BYTE |
| 11 | signed WORD |
| bits 3:2 | operation | comment |
|---|---|---|
| 00 | Equal Any | find any of the specified characters in the input string |
| 01 | Ranges | check if characters are within the specified ranges |
| 10 | Equal Each | check if the input strings are equal |
| 11 | Equal Ordered | check if the needle string is in the haystack string |
The SSE4.2 string compare instructions are advanced, powerful means for processing byte or word strings. The detailed explanation of SSE4.2 string instructions behaviour together with illustrations can be found on [1].
AVX is the abbreviation of Advanced Vector Extensions. The AVX implements larger 256-bit YMM registers as extensions of XMM. In 64-bit processors number of YMM registers is increased to 16. Many SSE instructions are expanded to handle operations with new, bigger data types without modification of mnemonics. The most important improvement in the instruction set of x64 processors is the implementation of RISC-like instructions in which the destination operand can differ from two source operands. A three-operand SIMD instruction format is called the VEX coding scheme. The AVX2 extension implements more SIMD instructions for operation with 256-bit registers. The AVX-512 extends the register size to 512 bits. An interesting, comprehensive description of a variety of x64 AVX instructions is available on website [2].