MMX, SSE and AVX extensions

At the point of personal computers' evolution, it became clear that they would be used not only for professional use, for example, in companies, financial institutions, and education, but also would be used as centres of home entertainment systems, enabling users to play games, watch videos, and listen to music. This led to empowering processors with the ability to process multimedia data. As the stereo sound has the form of a series of samples, and pictures are often represented by a matrix of three colour pixels, the method of improving the performance of multimedia processing is to introduce parallelism. At the processor level, the answer is SIMD - Single Instruction Multiple Data, which allows the execution unit to perform the same operation on many data units at the same time. Speaking more formally, one stream of instructions performs operations on many data streams. The first SIMD instructions introduced in the x86 family that follow this idea are MMX - MultiMedia eXtension.

MMX

MMX set of instructions operates on 64-bit packed data types. Packed means that the 64-bit data can contain 8 bytes, 4 words, or 2 doublewords. Based on this, the new data types were defined. Packed data types are also called vectors. Please refer to the section “Integer vector data types” for details. The MMX instructions operate using eight 64-bit registers named MM0 - MM7.

Data transfer

To copy data from memory or between registers, two new data transfer instructions were introduced. The movd instruction allows copying 32 bits of data between MMX registers and memory or between MMX registers and general-purpose registers of the main processor. The movq instruction allows copying 64 bits of data between MMX registers and memory or between two MMX registers. In all MMX instructions except data transfer, the first operand, which is a destination operand, is always an MMX register.

In modern 64-bit processors, the movq instruction is extended to copy 64-bit data between MMX registers and general-purpose registers of the main processor.

Basic vector calculations

The main idea of vector data processing is shown in figure 1. It shows the example of an operation performed with packed word vector data.

Illustration of the idea of vector data processing — Figure 1: The idea of vector data processing

When performing arithmetic operations, the main processor stores additional information in flags in the FLAG register. The MMX unit does not have flags for each calculated result, so some other approach should be used. Key information for arithmetic operations is the carry when adding and the borrow when subtracting. The simplest solution is to omit the carry, which, if the maximum value is exceeded, will result in truncation of the oldest bits and a reduction in the result. In the case of subtraction, the situation is reversed, and the resulting value will be larger than expected. For multimedia operations, a better solution is to limit the result to a maximum or minimum value. This approach is called saturation and comes in signed and unsigned versions. This means that, for example, when a pixel reaches its highest brightness, it will no longer be brightened. This way, information about brightness differences is lost, but the resulting image looks natural. Let's consider the addition operation on four arguments of the size of a word in three versions. In figure 2, the packed word addition with wraparound paddw is shown. In figure 3, the packed word addition with signed saturation paddsw is presented, and finally, the packed word addition with unsigned saturation paddusw is shown in figure 4.

Illustration of packed word addition with wraparound — Figure 2: The illustration of packed word addition with wraparound

Illustration of packed word addition with signed saturation — Figure 3: The illustration of packed word addition with signed saturation

Illustration of packed word addition with unsigned saturation — Figure 4: The illustration of packed word addition with unsigned saturation

The last letter in the instruction specifies the size of arguments and results. MMX addition and subtraction instructions are shown in the table 1

Table 1: MMAX addition and subtraction instructions

Mnemonic	operation	argument size	overflow management
paddb	addition	8 bytes	wraparound
paddw	addition	4 words	wraparound
paddd	addition	2 doublewords	wraparound
paddsb	addition	8 bytes	signed saturation
paddsw	addition	4 words	signed saturation
paddusb	addition	8 bytes	unsigned saturation
paddusw	addition	4 words	unsigned saturation
psubb	subtraction	8 bytes	wraparound
psubw	subtraction	4 words	wraparound
psubd	subtraction	2 doublewords	wraparound
psubsb	subtraction	8 bytes	signed saturation
psubsw	subtraction	4 words	signed saturation
psubusb	subtraction	8 bytes	unsigned saturation
psubusw	subtraction	4 words	unsigned saturation

Multiplication operation requires twice as much space for the result as the size of the arguments. The solution for this issue is to split the operation into two multiplication instructions, storing the higher and lower halves of the results. For vectors of words, the instructions are pmulhw (packed multiply with storing higher halves of the results) and pmullw (packed multiply with storing lower halves of the results), respectively. Later halves can be joined together, forming full results with unpack instructions. To unpack the higher halves of arguments, the punpckhwd (unpack from higher halves words to doublewords) instruction can be executed. For lower halves, the punpcklwd instruction can be used. The whole algorithm is shown in figure 5.

Illustration of packed word multiplication and unpacking results to doublewords — Figure 5: The illustration of packed word multiplication and unpacking results to doublewords

The code that calculates the presented multiplication can look as follows:

Numbers	DW  01ACh, 2112h, 03F3h, 00A4h,
	    0006h, 0137h, 0AB7h, 00D8h
LEA	    ESI, Numbers
MOVQ	    mm0, [ESI]	        ; mm0 = 00A4 03F3 2112 01AC
MOVQ	    mm1, [ESI+8]	; mm1 = 00D8 0AB7 0137 0006
MOVQ	    mm2, mm0
PMULLW	    mm0, mm1		; mm0 = 8A60 50B5 2CDE 0A08
PMULHW	    mm1, mm2		; mm1 = 0000 002A 0028 0000
MOVQ	    mm2, mm0
PUNPCKLWD   mm0, mm1		; mm0 = 0028 2CDE 0000 0A08
PUNPCKHWD   mm2, mm1		; mm2 = 0000 8A60 002A 50B5

Advanced calculations

The MMX set of instructions also contains the pmaddwd (multiply and add packed words to doublewords) instruction. It computes the products of the corresponding signed word operands. The four intermediate doubleword products are summed in pairs to produce two doubleword results. Its behaviour is shown in figure 6. This instruction can simplify the multiplication process in the case of multiplying two pairs of word arguments, while the other two pairs give results of zero.

Illustration of packed word multiplication and sum to doublewords — Figure 6: The illustration of packed word multiplication and sum to doublewords

Comparison

The set of comparison instructions allows for comparing values in two vectors. The result is stored as a mask of bits, with all ones at the element of the vector where the comparison result is true, and all zeros in the opposite case. There are six compare instructions as shown in table 2.

Table 2: MMX comparison instructions

Mnemonic	comparison type	argument size
pcmpeqb	equal	8 bytes
pcmpeqw	equal	4 words
pcmpeqd	equal	2 doublewords
pcmpgtb	greater than	8 bytes
pcmpgtw	greater than	4 words
pcmpgtq	greater than	2 doublewords

An example of comparison instruction for equality of two vectors of words is shown in figure 7.

Illustration of vector data comparison — Figure 7: Vector data comparison

Data conversion

The unpack instructions presented in figure 5 are not the only ones. Unpack instructions of high-order data elements are punpckhbw, punpckhwd, punpckhdq, and for low-order data elements are punpcklbw, punpcklwd, punpckldq. The figure 8 presents unpacking of high-order bytes to words, and figure 9 low-order bytes to words.

Illustration of unpacking high-order bytes to words — Figure 8: The illustration of unpacking high-order bytes to words

Illustration of unpacking low-order bytes to words — Figure 9: The illustration of unpacking low-order bytes to words

The pack instructions are used to shrink the size of arguments and pack them into smaller data. Only three pack instructions are implemented in MMX extension: packsswb - pack words into bytes with signed saturation, packssdw - pack doublewords into words with signed saturation, and packuswb - pack words into bytes with unsigned saturation. The example of pack instruction is shown in figure 10.

Illustration of packing doublewords to words — Figure 10: The illustration of packing doublewords to words

Shift

Packed shift instructions perform shift operations and elements of the specified size. All elements of the vector are shifted separately. In a logical shift, empty bits are filled with zeros, in arithmetical shift right, the higher bit is copied to preserve the sign of values. There are eight shift instructions, as presented in table 3

Table 3: MMX shift instructions

Mnemonic	operation	argument size	type of shift
psllw	shift left	4 words	logical
pslld	shift left	2 doublewords	logical
psllq	shift left	1 quadword	logical
psrlw	shift right	4 words	logical
psrld	shift right	2 doublewords	logical
psrlq	shift right	1 quadword	logical
psraw	shift right	4 words	arithmetic
psrad	shift right	2 doublewords	arithmetic

Logical

MMX logical instructions operate on the 64-bit data as a whole. They perform bitwise operations as shown in table 4.

Table 4: MMX logical instructions

Mnemonic	operation
pand	AND
pnand	AND NOT
por	OR
pxor	XOR

Co-existence of FPU and MMX

MMX instructions use the same physical registers as the FPU. As a result, mixing PFU and MMX instructions in the same fragment of the code is not possible. Switching between PFU and MMX in a code requires executing the emms instruction, which resets the FPU and MMX units. Fortunately, newer extensions (SSE, AVX) introduce a separate set of registers for improved flexibility.

SSE

The SSE is a large group of instructions that implement the SIMD processing towards floating point calculations and increase the size and number of the registers. The abbreviation SSE comes from the name Streaming SIMD Extensions. As the number of instructions introduced in all SSE versions exceeds a few hundred, in this section, we will present the general overview of each SSE version and detailed information on some chosen interesting instructions. The first group of SSE instructions defines a new vector data type containing four single-precision floating-point numbers. It's easy to calculate that it requires the 128-bit registers. These new registers are named XMM0 - XMM7 and are separated from any previously implemented registers, so SSE floating-point operations do not conflict with MMX and FPU.

Data transfer

In modern processors, it is very important to transfer data from and to memory effectively. The memory management unit can perform data transfer much faster if the data is aligned to a specific address. For SSE instructions, an address must be evenly divisible by 16. In the SSE extension, two versions of data transfer instructions were implemented. The movups copies packed single-precision data from any address, while the movaps moves data from an aligned address. The mivss moves the scalar single-precision value. It doesn't have to be aligned. It is also possible to copy data between the upper half of the XMM register and memory with the movhps instruction, between the lower half of the XMM register and memory with the movlps, and from the lower to higher half or from the higher to lower half of the XMM registers with the movhlps and movlhps, respectively. The movmskps instruction copies the most significant bits of single-precision floating-point values to a general-purpose register. It allows us to make a bit mask based on the sign bits of elements of the vector.

Calculations

The SSE implements the vector and scalar calculations on single-precision floating-point numbers. No prefix for instruction names operating on floating-point numbers was added, but the mnemonic suffix describes the type. PS (packed single) - action on vectors, SS (scalar single) - operation on scalars. If the instructions operate on halves of XMM registers (i.e. either refer to bits 0..63 or 64..127), the instruction mnemonics contain the letter L or H. The idea of vector and scalar operations is shown in figure 11 and figure 12, respectively.

Illustration of the idea of SSE vector data processing — Figure 11: The idea of vector data processing in SSE

Illustration of the idea of SSE scalar data processing — Figure 12: The idea of scalar data processing in SSE

In the SSE extension, mathematical calculations on single-precision floating-point numbers are implemented in both vector (packed) and scalar versions. These instructions are summarised in table 5.

Table 5: SSE math calculations instructions

Mnemonic	operation	argument type
addps	addition	vector
addss	addition	scalar
subps	subtraction	vector
subss	subtraction	scalar
mulps	multiplication	vector
mulss	multiplication	scalar
divps	division	vector
divss	division	scalar
rcpps	reciprocal	vector
rcpss	reciprocal	scalar
sqrtps	square root	vector
sqrtss	square root	scalar
rsqrtps	reciprocal of square root	vector
rsqrtss	reciprocal of square root	scalar
maxps	maximum (bigger) value	vector
maxss	maximum (bigger) value	scalar
minps	minimum (smaller) value	vector
minss	minimum (smaller) value	scalar

Comparison

Besides math calculations, there are instructions for comparing vector cmpps and scalar cmpss values. As a result, we obtain the all-ones or all-zeros fields as in MMX. The condition of comparison is encoded as the third 8-bit immediate argument. Assemblers usually implement a set of pseudoinstructions which automatically choose the constant value. The scalar version of these pseudoinstructions is presented in table 6

Table 6: SSE scalar comparison preudoinstructions

Pseudoinstruction	operation	instruction
cmpeqss xmm1, xmm2	equal	cmpss xmm1, xmm2, 0
cmpltss xmm1, xmm2	less then	cmpss xmm1, xmm2, 1
cmpless xmm1, xmm2	less or equal	cmpss xmm1, xmm2, 2
cmpunordss xmm1, xmm2	unordered	cmpss xmm1, xmm2, 3
cmpneqss xmm1, xmm2	not equal	cmpss xmm1, xmm2, 4
cmpnltss xmm1, xmm2	not less then	cmpss xmm1, xmm2, 5
cmpnless xmm1, xmm2	not less or equal	cmpss xmm1, xmm2, 6
cmpordss xmm1, xmm2	ordered	cmpss xmm1, xmm2, 7

With the use of the comiss instruction, it is possible to compare scalars and set the flags in the FLAG register directly according to the result of the comparison.

Logical instructions

There are four logical instructions which operate on all 128 bits of the XMM register. These are andps, andnps, orps, xorps. It is rather clear that the functions are logical and, logical and not, logical or and logical xor, respectively.

Data shuffle

Two unpack instructions, unpcklps and unpckhps, operate similarly to unpack instructions known already from MMX. Because the source and destination data are packed single-precision floating-point values, unlike in MMX, these instructions are not used to form longer data types, but change the positions of two elements from two vectors. It is presented in figure 13.

Illustration of SSE unpacking single-precision floating-point values — Figure 13: The illustration of SSE unpacking single-precision floating-point values

The more universal instruction is shufps. It selects two out of four single-precision values from the source argument and rewrites them to the bottom half of the destination argument. The upper half is filled with two single-precision values from the destination register. Which values will be taken is determined by the third, 8-bit immediate argument. Each two-bit field of the immediate determines the number of packed single values. For 11 - it is X3 or Y3, for 10 - X2 or Y2, for 01 - X1 or Y1 and for 00 - X0 or Y0. It is presented in figure 14.

Illustration of SSE shuffle single-precision floating-point values — Figure 14: The illustration of SSE shuffle single-precision floating-point values

Other instructions

Together with new data registers, an additional control register appeared in the processor. It is named MXCSR and is similar in meaning of flags to the FPU control register. New instructions are implemented, one to save MXCSR to memory stmxcsr, and one to restore this register from memory ldmxcrs. The instruction fxsave stores the state of both the x87 unit and the SSE extension, and fxrstor restores the state of both the x87 unit and the SSE extension. Some additional MMX instructions were introduced together with the SSE extension. In the SSE, the first set of data conversion instructions was implemented. The summary of all these instructions will be presented in the following chapter. Also, cache supporting instructions were added. These instructions will be described in the chapter on optimisation.

SSE2

The SSE2 set of instructions implements integer vector operations using the XMM registers. In general, the same instruction mnemonics defined for MMX can be used with XMM registers, offering twice as long vectors. Additionally, the floating-point calculations are complemented with vector operations on the double-precision data type. In XMM registers, vectors of two double-precision values can be processed. SSE2 uses the same software environment (eight 128-bit XMM registers) as SSE. In the SSE2 extension, the denormals-are-zeros mode was introduced. The processor automatically converts all unnormalized floating-point arguments to zero. In such a case, a flag indicating the denormalised argument is not set, and an exception is not raised. The denormals-are-zeros mode is not compliant with IEEE Standard 754, but it allows implementation of faster algorithms for advanced audio and video processing. The arithmetic instructions are similar to SSE, but they possess the suffix of pd - packed double or sd - scalar double instead of ps and ss, respectively.

Conversion

In figure 15 we present the type conversion instructions. They enable conversion between integer and floating-point data of various sizes and in different registers. The green arrows and instruction nodes represent the conversion from single-precision floating-point to integers, pink represents the conversion from double-precision floating-point to integers, blue represents the conversion from integers to floating points, and orange represents the conversion between single and double precision floating-point.

SSE3

The SSE3 is a set of 13 instructions. The main innovation in SSE3 is the implementation of horizontal instructions. These instructions perform calculations on the elements of a vector within the same register. There are four such instructions. The haddpd performs horizontal addition of double-precision values, the hsubpd is a horizontal subtraction of double-precision values, the haddps is a horizontal addition of single-precision values, and the hsubps is a horizontal subtraction of single-precision values. All horizontal instructions operate in a similar manner. The lower (bottom) part of the resulting vector is the result of operation on the bottom and top elements of the first (destination) operand; the higher (top) part of the resulting vector is the result of operation on the second (source) operand's bottom and top. The best way to present the principles of horizontal operations is a picture. Because in the subtraction operation the order of arguments is important, the hsubpd instruction is shown in figure 16.

Illustration of a horizontal subtraction instruction — Figure 16: The illustration of a horizontal subtraction instruction

While there are more than two elements of source vectors, like in the hsubps instruction, it is also important to know the order of the elements in the resulting vector. Please look at the figure 17.

Illustration of a horizontal single precision subtraction instruction — Figure 17: The illustration of a horizontal single precision subtraction instruction

SSSE3

This abbreviation stands for Supplemental Streaming SIMD Extension 3. It is a set of 16 instructions introduced in the Core 2 architecture. It implements integer horizontal operations on XMM registers. The principles are the same as in horizontal instructions in SSE3, but instructions can process vectors of doublewords or words. They are summarised in the table 7.

Table 7: SSSE3 horizontal integer instructions

Instruction	operation	data
phaddd	addition	unsigned doublewords
phaddw	addition	unsigned words
phaddsw	saturated addition	signed words
phsubd	subtracion	unsigned doublewords
phsubw	subtracion	unsigned words
phsubsw	saturated subtracion	signed words

Two data shuffle instructions are worth mentioning. The pshufb instruction makes copies of bytes from the first 128-bit operand based on the control information taken from the second 128-bit operand. Each byte in the control operand determines the resulting byte in the respective position.

bit 7 is 1 - byte is cleared
bit 7 is 0 - byte contains a copy of the source byte
bits 0-3 - a number of the source byte to be copied

The illustration is shown in figure 18.

Illustration of a byte shuffle instruction — Figure 18: The illustration of a byte shuffle instruction

The palignr instruction combines bytes from two source operands as shown in figure 19. The position of the byte split is specified as third immediate. In the figure, the immediate is equal to 2.

Illustration of an aligned byte combine instruction — Figure 19: The illustration of an aligned byte combine instruction

SSE4

The SSE4 is composed of SSE4.1 and SSE4.2. These groups include instructions supplementing previous extensions. For example, there are eight instructions which expand support for packed integer minimum and maximum determination, or twelve instructions which improve packed integer format conversions with sign extension and zero extension. The dpps and dppd instructions calculate the dot product of four single-precision and two double-precision operands, respectively. Additionally, the arguments are controlled with the third immediate operand. The example showing the dppd is presented in figure 20.

Illustration of a dot product calculation instruction — Figure 20: The illustration of a dot product calculation instruction

There are also advanced shuffle, insert and extract instructions which make it possible to manipulate positions of the data of various types. A few examples will be shown in the following figures. The insertps inserts a scalar single-precision floating-point value with the position of the vector's element in source and destination controlled with an 8-bit immediate. The example showing the insertps instruction is presented in figure 21. In this example, the immediate contains the bit value of 10011000b.

Illustration of an example of an advanced shuffle instruction — Figure 21: The illustration of an example of an advanced shuffle instruction

In SSE4.2, the set of string compare instructions was added. As the XMM registers can contain sixteen bytes, it is much more efficient to implement string processing algorithms with bigger XMM registers than with registers in the main processor with the use of strong instructions. There are four string compare instructions (see table 8), but each of them can be configured to achieve different functionalities. The length of strings can be explicit or implicit. Explicit length means that the length of the first operand is specified with the RAX register, and the length of the second operand is specified with the RDX register. Implicit length means that both operands contain null-terminated strings. Instructions can produce two kinds of results. Index means that the index of the first or last result is returned. Mask means that the bit mask is returned (one bit for each two elements compared) or a mask of the size of the elements (similarly to MMX compare).

Table 8: SSE4.2 string compare instructions

Instruction	length	type of the result
pcmpestri	explicit	index
pcmpestrm	explicit	mask
pcmpistri	implicit	index
pcmpistrm	implicit	mask

The third, immediate operand encodes the comparison method and result encoding.

Table 9: SSE4.2 string compare input data

bits 1:0	data type
00	unsigned BYTE
01	unsigned WORD
10	signed BYTE
11	signed WORD

Table 10: SSE4.2 string compare method encoding

bits 3:2	operation	comment
00	Equal Any	find any of the specified characters in the input string
01	Ranges	check if characters are within the specified ranges
10	Equal Each	check if the input strings are equal
11	Equal Ordered	check if the needle string is in the haystack string

The SSE4.2 string compare instructions are advanced, powerful means for processing byte or word strings. The detailed explanation of SSE4.2 string instructions behaviour together with illustrations can be found on ^[1].

AVX

AVX is the abbreviation of Advanced Vector Extensions. The AVX implements larger 256-bit YMM registers as extensions of XMM. In 64-bit processors number of YMM registers is increased to 16. Many SSE instructions are expanded to handle operations with new, bigger data types without modification of mnemonics. The most important improvement in the instruction set of x64 processors is the implementation of RISC-like instructions in which the destination operand can differ from two source operands. A three-operand SIMD instruction format is called the VEX coding scheme. The AVX2 extension implements more SIMD instructions for operation with 256-bit registers. The AVX-512 extends the register size to 512 bits. An interesting, comprehensive description of a variety of x64 AVX instructions is available on website ^[2].

^[1] https://www.officedaytime.com/simd512e/simdimg/str.php?f=pcmpestri

^[2] https://www.officedaytime.com/simd512e/

Table of Contents

MMX, SSE and AVX extensions

MMX

Data transfer

Basic vector calculations

Advanced calculations

Comparison

Data conversion

Shift

Logical

Co-existence of FPU and MMX

SSE

Data transfer

Calculations

Comparison

Logical instructions

Data shuffle

Other instructions

SSE2

Conversion

SSE3

SSSE3

SSE4

AVX