Table of Contents

Instruction Set of x86 - Essentials

Instruction groups

The x64 processors can execute an extensive number of different instructions. In the documentation of processors, we can find several ways of dividing all instructions into groups. The most general division, according to AMD, defines five groups of instructions:

Intel defines the following groups of instructions.

There is also a long list of extensions defined, including SSE4.1, SSE4.2, Intel AVX, AMD 3DNow! and many others. For a detailed description of instruction groups, please refer to

Details of every instruction you can find in the description of the instruction set

There are also specialised websites with detailed explanations of instructions that you can use to get a lot of additional information. Among others, you can visit:

In this book, we will present most of the general-purpose instructions and provide general ideas on the chosen extensions, including FPU, MMX, SSE, and AVX.

General Purpose Instructions

General-purpose instructions can be divided into some subgroups.

Condition Codes

Before describing instructions, let's present the condition codes. The condition code takes the form of a suffix to the instruction and influences its behaviour in such a way that if the condition is met, the instruction is executed; if the condition is not met, the processor moves on to the next instruction in the program. The condition that is checked during the execution of the conditional instruction is based on the current state of the flags in the EFLAGS register. The flags in the EFLAGS register are modified by instructions, mainly arithmetic, logical, shift, or special flag manipulation instructions. It is important to note that flags are not modified when copying data, so to check whether the value just read is zero, you should perform, for example, a comparison. Condition codes together with flags checked are presented in table 1.

Table 1: Condition codes
Condition code cc Flags checked Comment
E ZF = 1 Equal
Z ZF = 1 Zero
NE ZF = 0 Not equal
NZ ZF = 0 Not zero
A CF=0 and ZF=0 Above
NBE CF=0 and ZF=0 Not below or equal
AE CF=0 Above or equal
NB CF=0 Not below
B CF=1 Below
NAE CF=1 Not above or equal
BE CF=1 or ZF=1 Below or equal
NA CF=1 or ZF=1 Not above
G ZF=0 and SF=OF Greater
NLE ZF=0 and SF=OF Not less or equal
GE SF=OF Greater or equal
NL SF=OF Not less
L SF<>OF Less
NGE SF<>OF Not greater or equal
LE ZF=1 or SF<>OF Less or equal
NG ZF=1 or SF<>OF Not greater
C CF=1 Carry
NC CF=0 Not carry
O OF=1 Overflow
NO OF=0 Not ovrflow
S SF=1 Sign (negative)
NS SF=0 Not sign (non-negative)
P PF=1 Parity
PE PF=1 Parity even
NP PF=0 Not parity
PO PF=0 Parity odd

Data transfer instructions

Almost all assembler tutorials start with the presentation of the mov instruction, which is used to copy data from the source operand to the destination operand. Our book is not an exception, and we've already shown this instruction in examples presented in previous sections.

MOV

Let's look at some additional variants.

mov al, bl         ;copy one byte from bl to al
mov ax, bx         ;copy word (two bytes) from bx to ax
mov eax, ebx       ;copy doublweword (four bytes) from ebx to eax
mov rax, rbx       ;copy quadword (eight bytes) from rbx to eax 

In the mov instruction, the size of the source argument must be the same as the size of the destination argument. Arguments can be stored in registers, in memory addressed directly or indirectly. One of them can be constant (immediate). Only one memory argument is allowed. This comes from instructions encoding. In instructions, there is only one possible direct or indirect argument to be encoded. That's why most instructions, not only mov, can operate with one memory argument only. There are some exceptions, for example, string instructions, but such instructions use specific indirect addressing.

mov al, 100        ;0xB0, 0x64         copy constant (immediate) of the value 100 (0x64) to al
mov al, [bx]       ;0x67, 0x8A, 0x07   copy byte from the memory at address stored in bx to al (indirect addressing)
 
;Notice the difference between two following instructions
mov eax, 100       ;0xB8, 0x64, 0x00, 0x00, 0x00   copy constant 100 to eax
mov eax, [100]     ;0xA1, 0x64, 0x00, 0x00, 0x00   copy value from memory at address 100
 
;It is possible to copy a constant to memory addressed directly or indirectly
;operand size specifier dword ptr is required to inform the processor about the size of the argument
mov dword ptr ds:[200], 100   ;0xC7, 0x05, 0xC8, 0x00, 0x00, 0x00, 0x64, 0x00, 0x00, 0x00
                              ;copy value of 100, encoded as dword (four bytes), 0x64 = 100
                              ;to memory at address 200, encoded as four bytes,  0xC8 = 200
 
mov dword ptr [ebx], 100      ;0xC7, 0x03, 0x64, 0x00, 0x00, 0x00
                              ;copy value of 100, encoded as dword (four bytes), 0x64 = 100
                              ;to memory addressed by ebx 

Conditional move

Starting from the P6 machines, the conditional move instruction cmovcc was introduced. This works similarly to mov, but copies data if the specified condition is true. The condition code is one of the codes presented in the section “Condition Codes”. If the condition is false, the instruction simply passes through without modifying the arguments. Conditional move instructions can be used to avoid conditional jumps. For example, if we need to copy data from ebx to ecx, if the result of the previous operation is negative, we can write the following instruction.

cmovs ecx, ebx

Sign extension

In the situation of copying data of a smaller size (expressed in number of bits) to a bigger destination argument, the question arises as to what to do with the remaining bits. Let us consider copying an 8-bit value from bl to the 16-bit ax register. If the value copied is unsigned or positive (let it be 5), the remaining bits should be cleared.

              ;   ah      al
mov al, bl    ;        00000101   = 5 in al
mov ah, 0     ;00000000
              ;0000000000000101   = 5 in ax

If the value is negative (e.g. -5) the situation changes.

              ;   ah      al
mov al, bl    ;        11111011   = -5 in al
mov ah, 0     ;00000000
              ;0000000011111011   = 251 in ax

It is visible that to preserve the original value, the upper bits must be filled with ones, not zeros.

              ;   ah      al
mov al, bl    ;        11111011   = -5 in al
mov ah, 0xFF  ;11111111
              ;1111111111111011   = -5 in ax

There are special instructions which perform automatic sign extension, copying the sign bit to all higher bit positions. They can be considered as type conversion instructions. These instructions do not have any arguments as they operate on the accumulator only.

Sign extension instructions work solely with the accumulator. Fortunately, there are also more universal instructions which copy and extex data at the same time.

Exchange instructions

The exchange instructions swap the values of operands. A single exchange instruction can replace three mov instructions while swapping the contents of two arguments, so they can be useful in optimising some algorithms. They are helpful in the implementation of semaphores, even in multiprocessor systems. The xchg instruction swaps the values of two arguments. If one of the arguments is in memory, the instruction behaves as with the LOCK prefix, allowing for semaphore implementation. The cmpxchg has three arguments: source, destination and accumulator. It compares the destination argument with the accumulator; if they are equal, the destination argument value is replaced with the value from the source operand. It is used to test and modify semaphores. Its operation is presented in fig 1. In newer machines, the eight- and sixteen-byte versions were added: cmpxchg8b and cmpxch16b. They always use ECX:EBX or RCX:RBX as the source argument and EDX:EAX or RDX:RAX as the accumulator. The destination argument is in the memory.

Illustration of cmpxchg instruction
Figure 1: Explanation of cmpxchg instruction

The xadd instruction exchanges two arguments, adds them, and stores the sum in a destination argument. Together with a LOCK prefix, it can be used to implement a DO loop executed by more than one processor simultaneously.

The bswap instruction is a single-argument instruction; it changes the order of bytes in a 32- or 64-bit register. It can be used to convert little-endian data to big-endian representation and vice versa, as shown in figure 2.

Illustration of bswap instruction
Figure 2: Explanation of bswap instruction in 32-bit mode

Stack instructions

A stack is a special structure in the memory that automatically stores the return address (address of the next instruction) while procedure calling (it is described in detail in the section about the call instruction). It is also possible to use the stack for local variables in functions, to pass arguments to procedures, and for temporal data storage. In x86 architecture, the stack is supported by hardware with the special stack pointer register. Instructions operating on the stack automatically modify the stack pointer in a way that it always points to the top of the stack. The push instruction decrements the stack pointer and places the data onto the stack. As a result, the stack pointer points to the last data on the stack. It is shown in figure 3.

Illustration of push instruction
Figure 3: Explanation of push instruction

The pop instruction takes data off the stack, copies it into the destination argument, and increments the stack pointer. After its execution, the stack pointer points to the previous data stored on the stack. It is shown in figure 4.

Illustration of pop instruction
Figure 4: Explanation of pop instruction

There are also instructions that push or pop all eight general-purpose registers (including the stack pointer). The 16-bit registers are pushed with pusha and popped with popa instructions. For 32-bit registers, the pushad and popad instructions can be used, respectively. The order of registers on the stack is shown in figure 5. These instructions are not supported in 64-bit mode.

Illustration of pushad and popad instructions
Figure 5: Explanation of pushad and popad instructions

Arithmetic instructions

Arithmetic instructions perform calculations on binary encoded data. It is worth noting that the processor does not distinguish between unsigned and signed values; it is the responsibility of the programming engineer to provide correct input values and properly interpret the results obtained.

There are instructions which support decimal arithmetic, but due to the rare use of BCD numbers in modern software, they are not available in x64 mode.

Addition and subtraction

There are two adding instructions. The add adds two values from the destination and source arguments and stores the result in the destination argument. It modifies the flags in the EFLAG register according to the result. The adc instruction additionally adds “1” if the carry flag (CF) is set. It allows the processor to calculate the sum of the values bigger than can be encoded in a register (for example, 128-bit integers in a 64-bit processor). Similarly, there are two subtraction instructions. The sub subtracts the source argument from the destination argument, stores the result in the destination, and modifies the flags according to the result. The sbb instruction calculates the difference of arguments minus “1” if the CF flag is set (here, CF plays the role of the borrow flag).

Incrementation and decrementation

The inc instruction adds “1” to, and dec instruction subtracts “1” from the argument. The argument is treated as an unsigned integer.

Multiply

Two multiply instructions are implemented. The mul is a one-argument instruction. It multiplies the content of the argument and the accumulator, treated as unsigned numbers. The size of the accumulator corresponds to the size of the argument. The result is stored in the accumulator. As the multiplication can give the result even twice as big as the input values, it is stored in a bigger accumulator size, as shown in the table 2.

Table 2: Multiply instruction argument and result size
Argument Accumulator Result
8 bits AL AX
16 bits AX DX:AX
32 bits EAX EDX:EAX
64 bits RAX RDX:RAX

The imul instruction implements the signed multiply. It can have one, two or three arguments. The single-argument version behaves the same way as the mul instruction. The two-argument version multiplies the 16-, 32-, or 64-bit register as the destination operand by the argument of the same size. The three-argument version multiplies the content of the source argument by the immediate and stores the result in the destination of the same size as the arguments. The destination must be the register.

Divide

Two divide instructions are implemented. The div is a one-argument instruction. It divides the content of the accumulator by the argument, treated as unsigned numbers. The size of the accumulator is twice as big as the size of the argument. The result is stored as two integer values of the same size as the argument. The quotient is placed in the lower half of the accumulator, and the remainder in the higher half of the accumulator. Depending on the size of the argument, the accumulator is understood as a pair of registers DX:AX, EDX:EAX or RDX:RAX, as shown in the table 3.

Table 3: Divide instruction arguments and results size
Argument Accumulator Quotient Remainder
8 bits AX AL AH
16 bits DX:AX AX DX
32 bits EDX:EAX EAX EDX
64 bits RDX:RAX RAX RDX

The idiv instruction implements the signed divide. It behaves the same way as the div instruction except for the type of numbers.

Logical instructions

The set of logical instructions contains and, or, xor and not instructions. All of them perform bitwise Boolean operations corresponding to their names. The not is a single-argument instruction; others have two arguments.

Shift and rotate instructions

Shift and rotate instructions treat the argument as the shift register. Each bit of the argument is moved to the neighbour position on the left or right, depending on the shift direction. The number of bit positions for the shift can be specified as a constant or in the CX register. Shift instructions can be used for multiplying (shift left) and dividing (shift right) by a power of two. Shift instructions have two versions: logical and arithmetical. Logical shift left shl and arithmetical shift left sal behave the same, filling the empty bits (at the LSB position) with zeros. Logical shift right shr fills the empty bits (at the MSB position) with zeros, while the arithmetical shift right sar makes a copy of the most significant bit, preserving the sign of a value. It is shown in figure 6.

Illustration of shift arithmetical and logical left and right instructions
Figure 6: Explanation of shift instructions

There are two double shift instructions which move bits from the source argument to the destination argument. The number of bits is specified as the third argument. Shift double right has shrd mnemonic, while shift double left has shld mnemonic. The operation of shift double instructions is presented in figure 7.

Illustration of double shift instructions
Figure 7: Explanation of double shift instructions

For all shift instructions, the last bit shifted out is placed in the carry flag.

Rotate instructions shift bits left rol or right ror in the argument, and additionally move bits around from the lowest to the highest or from the highest to the lowest position. Behaviour of rotate instructions is shown in figure 8.

Illustration of rotate instructions
Figure 8: Explanation of rotate instructions

Rotate through carry left rcl and right rcr, treat the carry flag as the additional bit while rotating. They can be used to collect bits to form multi-bit data. Behaviour of rotate with carry instructions is shown in figure 9.

Illustration of rotate with carry instructions
Figure 9: Explanation of rotate with carry instructions

Bit and Byte Instructions

Bit test instruction bt makes a copy of the selected bit in the carry flag. The bit for testing is specified by a combination of two arguments. The first argument, named the bit base operand, holds the bit. It can be a register or a memory location. The second operand is the bit offset, which specifies the position of the bit operand. It can be a register or an immediate value. It starts counting from 0, so the least significant bit has the position 0. An example of the behaviour of the bt instruction is shown in figure 10.

Illustration of bit test instruction
Figure 10: Explanation of bit test instruction

Bit test and modify instructions first make a copy of the selected bit, and next modify the original bit value with the one specified by the instruction. The bts sets the bit to one, btr clears the bit (resets to zero value), btc changes the state of the bit to the opposite (complements).

The bit scan instructions search for the first occurrence of the bit of the value 1. The bit scan forward bsf scans starting from the least significant bit towards higher bits, bit scan reverse bsr starts from the most significant bit towards lower bits. Both instructions return the index of the found bit in the destination register. If there is no bit of the value 1, the zero flag is set, and the destination register value is undefined.

The test instruction performs the logical AND function without storing the result. It just modifies flags according to the result of the AND operation.

The setcc instruction sets the argument to 1 if the chosen condition is met, or clears the argument if the condition is not met. The condition can be freely chosen from the set of conditions available for other instructions, for example, cmovcc. This instruction is useful to convert the result of the operation into the Boolean representation.

The popcnt instruction counts the number of bits equal to “1” in a data. The applications af this instruction include genome mining, handwriting recognition, digital health workloads, and fast hamming distance counts[7].

The crc32 instruction implements the calculation of the cyclic redundancy check in hardware. The polynomial of the value 11EDC6F41h is fixed.

Control transfer instructions

Before describing the instructions used for control transfer, we will discuss how the destination address can be calculated. The destination address is the address given to the processor to make a jump to.

Near and far transfer

While the segmentation is enabled, the destination address can be given as the offset only or in full logical form. If there is an offset only, the instruction modifies solely the instruction pointer, the jump is performed within the current segment and is called near. If the address is provided in full logical form, containing segment and offset parts, the CS and IP registers are modified. Such an instruction can perform a jump between segments and is called far.

Absolute and relative address

An absolute address is given as a value specifying the destination address as the number of the byte counted from the beginning of the memory, or, if segmentation is enabled, as the offset from the beginning of the segment. A relative address is calculated as the difference between the current value of the instruction pointer and the absolute destination address. It is provided in the instructions as the signed number representing the distance between the current and destination addresses. If it is possible to encode the difference as an 8-bit signed value, the jump is called short. Usually, an assembler automatically chooses the shortest possible encoding.

Conditional and unconditional control transfer

Conditional transfer instructions check the state of chosen flags in the Flags register and perform the jump to the specified address if the condition gives a true result. If the condition results in false, the processor goes to the next instruction in the instruction stream. Conditions are specified the same way as in cmovcc instruction as the suffix to the main mnemonic. Unconditional transfer instructions are always executed the same way. They jump to the specified address without any condition checking.

Unconditional control transfer instructions

Unconditional control transfer instructions perform the jump to the new address to change the program flow. The jmp instruction jumps to a destination address by putting the destination address in the instruction pointer register. If segmentation is enabled and the destination address is placed in another segment than the current one, it also modifies the CS register. The call instruction is designed to handle subroutines. It also jumps to a destination address, but before putting the new value into the instruction pointer, it pushes the returning address onto the stack. The returning address is the address of the next instruction after the call. This allows the processor to use the returning address later to get back from the subroutine to the main program. The ret instruction forms a pair with the call. It uses the information stored on the stack to return from a subroutine. The process of calling a procedure and returning to the main program is shown in figure 11.

Illustration of call and return instructions
Figure 11: Explanation of call and ret instructions
In assembler, subroutines are called procedures. In other languages, you can find the names: function (it can return the resulting value), method (in object-oriented languages) or subprogram.

Interrupts

An interrupt mechanism in x86 works with hardware-signalled interrupts or with special interrupt instructions. Return from an interrupt is performed by executing the iret instruction. In 32 and 64-bit architectures, the mnemonic for this instruction is iretd. The iret instruction differs from the ret instruction with popping of the stack not only the return address but also the content of the Flags register. This keeps the content of this register unmodified after return, and additionally prevents unintentional blocking following interrupts. The process of interrupt handler calling and returning to the main program is shown in figure 12.

Illustration of interrupt signalling and return from the handler
Figure 12: Illustration of interrupt signalling and return from the handler

Software interrupts are handled the same way as signalled by the hardware. The int instruction signals the interrupt of a given number. There are also some special interrupt instructions. The int1 and int3 are one-byte special machine codes used for debugging, into signals a software overflow exception if the OF flag is set, and bound raises the bound range exceeded exception (int 5) when the tested value is over or under the defined bounds. The last two instructions are not valid in 64-bit mode.

In 32 and 64-bit operating systems, the interrupts are handled by the OS and called through the interrupt descriptors, called gates.

Conditional control transfer instructions

The jcc instructions are used to test the state of flags and perform the jump to the destination address if the condition is met. In modern pipelined processors, it is recommended to avoid using conditional jumps if possible, ensuring that the program flows continuously, without the need to invalidate the pipeline. It is important to remember that flags are modified as a result of executing the arithmetic or logic instruction, but not the mov instruction. For example, if we need to test if some variable is zero, we can write such code:

cmp var1, 0     ;compare variable
jz is_zero      ;conditional jump to address is_zero
mov rax, "1"    ;if not zero put ASCII code of "1" in rax
jmp not_zero    ;jump unconditionally over next instruction
is_zero:        ;label to jump to if var1 is zero
mov rax, "0"    ;if zero put ASCII code of "0" in rax
not_zero:       ;label to jump to if var1 is not zero
You can try to optimise this code by avoiding jumps. Try to use the conditional mov instruction.

Loop instructions

The loop instruction is used to implement a loop, which is executed a known number of times. The number of iterations should be set before a loop in the counter register (CX/ECX/RCX). The loop instruction automatically decrements the counter register, checks if it reaches zero and if not jumps to the address, which is the argument of the instruction and is assumed as the beginning address of a loop. If the counter reaches zero, the loop instruction goes further to the next instruction in a stream. There are also conditional versions of the loop instruction, which allow finishing the iteration process before the counter reaches zero. The loope or loopz instructions continue the iteration if the counter is above zero and the zero flag (ZF) is set. The loopne or loopnz continue iteration if the counter is above zero and the zero flag (ZF) is cleared. The loop instruction can cause the system to iterate many times if the counter register is zero before entering the loop. As the first step is the decrementing of the counter, it will result in a value composed of all “1”. For CX, the loop will be executed 65536 times, for ECX more than 4 billion times and for RCX 184 quintillion 466 quadrillion 744 trillion 73 billion 709 million 551 thousand and 616 times! Understandably, we should avoid such a situation. The jcxz, jecxz and jrcxz instructions can help to jump over the entire loop if the counter register is zero at the beginning, as in the following code.

lea rbx, table   ;table with values to sum
mov rcx, size    ;size of a table - we can't ensure it's not zero
xor rdx, rdx     ;zero rdx - it will be the sum af elements
jrcxz end_loop   ;jump over the loop if rcx is zero
begin_loop:
add rdx, [rbx]   ;add the item to the resulting value
inc rbx          ;point to another item in a table
loop begin_loop  ;loop
end_loop:
According to the information found on the Internet, the loop instructions are not optimised for modern pipelined processors, and are often replaced with compare and conditional jump instructions.

String Instructions

String instructions are developed to perform operations on elements of data tables, including text strings. These instructions can access two elements in memory - source and destination. If segmentation is enabled, the source operand is identified with SI/ESI and placed always in the data segment (DS), the destination operand is identified with DI/EDI and stored in the extended data segment (ES). In 64-bit mode, the source operand is identified with RSI, and the destination operand is identified with RDI. They can operate on bytes, words, doublewords or quadwords. The size of the element is specified as the suffix of the instruction or derived from the size of the arguments specified in the instruction.

String copy

The movs instruction copies the element of the source string to the destination string. It requires two arguments of the size of bytes, words, doublewords or quadwords. The movsb instruction copies a byte from the source string to the destination string. The movsw instruction copies a word from the source string to the destination string. The movsd instruction copies a doubleword from the source string to the destination string. The movsq instruction copies a quadword from the source string to the destination string.

The locations of the source and destination operands are always accessed with the use of the source and destination index registers, which must be loaded correctly before the string instruction is executed. Arguments, if present, are used to determine the size of the element only.

Store string

These instructions store the content of the accumulator to the destination operand. The stos instruction copies the content of the accumulator to the destination string. It requires one argument of the size of byte, word, doubleword or quadword. The stosb instruction copies a byte from the AL to the destination string. The stosw instruction copies a word from the AX to the destination string. The stosd instruction copies a doubleword from the EAX to the destination string. The stosq instruction copies a quadword from the RAX to the destination string.

Load string

These instructions load the content of the source string to the accumulator. The lods instruction copies the content of the source string to the accumulator. It requires one argument of the size of byte, word, doubleword or quadword. The lodsb instruction copies a byte from the source string to the AL. The lodsw instruction copies a word from the source string to the AX. The lodsd instruction copies a doubleword from the source string to the EAX. The lodsq instruction copies a quadword from the source string to the RAX.

String compare

Strings can be compared, which means that the element of the destination string is compared with the element of the source string. These instructions set the status flags in the flags register according to the result of the comparison. The elements of both strings remain unchanged. The cmps instruction compares the element of a source string with the element of the destination string. It requires one argument, which specifies the size of the accumulator and the data element. The cmpsb instruction compares a byte from the source string with a byte from the destination string. The cmpsw instruction compares a word from the source string with a word from the destination string. The cmpsd instruction compares a doubleword from the source string with a doubleword from the destination string. The cmpsq instruction compares a quadword from the source string with a quadword from the destination string.

String scan

Strings can be scanned, which means that the element of the destination string is compared with the accumulator. These instructions set the status flags in the flags register according to the result of the comparison. The accumulator and string element remain unchanged. The scas instruction compares the accumulator with the element of the destination string. It requires one argument, which specifies the size of the accumulator and the data element. The scasb instruction compares the AL with a byte from the destination string. The scasw instruction compares the AX with a word from the destination string. The scasd instruction compares the EAX with a doubleword from the destination string. The scasq instruction compares the RAX with a quadword from the destination string.

Repeated string instructions

All string instructions can be preceded by the repetition prefix to automate the processing of multiple-element tables. Use of the prefix enables the instructions to automatically repeat the instruction execution according to the content of the counter register and modify the source and destination addresses in index registers, accordingly to the size of the element. Index registers can be incremented or decremented depending on the direction flag (DF) state. If DF is “0”, the addresses are incremented; if DF is “1” addresses are decremented. While the string element's size is a byte, the addresses are modified by 1. For words, the addresses are modified by 2, for doublewords by 4, and for quadwords by 8. The rep prefix allows block copying, storing and loading of an entire string rather than a single element. The use of repeated string instructions enables copying the entire string from one place in memory to another, or filling up the memory regions with a pattern.

The repe or repz prefixes additionally test if the zero flag is “1”, to finish prematurely the process of string scan or comparison. The repne or repnz prefixes test if the zero flag is “0” to stop the iteration throughout the string. The conditional prefixes are intended to be used with scas or cmps instructions. The use of repeated string instructions with conditional prefixes enables string comparison for equality or differences, or to find the element in a string.

To properly use the repeated string instructions, follow these steps:

  1. Set the SI/ESI/RSI with the address of the source string.
  2. Set the DI/EDI/RDI with the address of the destination string.
  3. Clear of set the DF to determine the direction of string processing - from lower to higher or from higher to lower addresses, respectively.
  4. Set the counter register CX/ECX/RCX with the number of elements to process
  5. Execute the string instruction with repetition prefix and suffix according to the size of the element.

I/O Instructions

These instructions allow the processor to transfer data between the accumulator register and a peripheral device. A peripheral device can be addressed directly or indirectly. Direct addressing uses an 8-bit constant as the peripheral address (named in x86 I/O port), and it accesses only the first 256 port addresses. Indirect addressing uses the DX register as the address register, enabling access to the entire I/O address space of 65536 addresses. The in instruction reads data from a port to the accumulator. The out instruction writes the data from the accumulator to the port. The size of the accumulator determines the size of the data to be transferred. It can be AL, AX or EAX. The I/O instructions also have string versions. Instructions to read the port to a string are ins, insb, insw, and insd. Instructions to write a string to a port are outs, outsb, outsw, and outsd. In all string I/O instructions, the port is addressed with the DX register. Rules for addressing the memory are the same as in string instructions.

Enter and Leave Instructions

Enter instruction creates the stack frame for the function. The stack frame is a place on the stack reserved for the function to store arguments and local variables. Traditionally, we access the stack frame with the use of the RBP register, but we need to preserve its content before use. The enter instruction can be nested or non-nested. Not-nested saves the RBP on the stack, copies the stack pointer value to RBP, and adjusts the stack pointer with the constant value, which is the first operand of the instruction. After these steps, the RSP points to the top of the stack frame, and the RBP points to the stack base. The nested version creates the path to the higher-level functions' stack frames by adding their momentary value of RBP. The leave instruction reverses what enter did at the end of the function. The enter should be placed at the very beginning of the function, while the leave just before ret.

According to the information on compiler behaviour, the enter instruction is never used by compilers, while the leave instruction is rarely, but sometimes used.

Flag Control Instructions

Flag control instructions are typically used to set or clear the chosen flag in the RFLAGS register. We can only control three flags directly. The carry (CF) flag can be used in conjunction with the rotate-with-carry instructions to convert the series of bits into a binary-encoded value. The direction (DF) flag determines the direction of modification of index registers RSI and RDI when executing string instructions. If the DF flag is clear, the index registers are incremented; if the DF flag is set, the registers are decremented after each iteration of a string instruction. The interrupt (IF) flag enables or disables hardware interrupts. If the IF flag is set, the hardware interrupts are enabled; if the IF flag is clear, hardware interrupts are masked. The summary of instructions is shown in the table 4.

Table 4: Flags manipulating instructions
Instruction Behavoiur flag affected
stc set carry flag CF=1
clc clear carry flag CF=0
cmc complement carry flag CF=not CF
std set direction flag DF=1
cld clear direction flag DF=0
sti set interrupt flag IF=1
cli clear interrupt flag IF=0

The flags register can be pushed onto the stack and popped afterwards. This can be done inside the procedure, but also to test or manipulate bits in the flags register, for which modifications are not supported by a special instruction. The pushf pushes the FLAGS register, the pushfd pushes the EFLAGS register, and the pushfq pushes the RFLAGS register onto the stack. The popf pops the FLAGS register, the popfd pops the EFLAGS register, and the popfq pops the RFLAGS register from the stack. There is also a possibility to copy SF, ZF, AF, PF, and CF to the AH register with the lahf instruction, and store these flags back from AH with the use of the sahf instruction.

Segment Register Instructions

Segment register instructions are used to load a far pointer to a pair of registers. One of the pair is the segment, which is determined by the instruction; another is the offset and appears as the destination argument. The source argument is the far pointer stored in the memory. These instructions include lds – load far pointer using DS, les – load far pointer using ES, lfs – load far pointer using FS, lgs – load far pointer using GS, and lss – load far pointer using SS. The following example shows loading far pointer in 16-bit mode.

; Load far pointer to DS:BX
; Variable Far_point holds the 32-bit address
 
lds  BX,Far_point
 
; Instruction above is equal to:
 
mov  AX,WORD PTR Far_point+2 ; Take higher word of far pointer
mov  DS,AX                   ; Store it in DS
mov  BX,WORD PTR Far_point   ; Store lower word of far pointer in BX

In 64-bit mode, lds and les instructions are not supported.

Miscellaneous instructions

No operation

The nop instruction performs no operation. The only result is incrementaion of the instruction pointer. In real, it is an alias to the instruction xchg eax, eax.

nop             ;encoded as 0x90
xchg eax, eax   ;encoded as 0x90

Load effective address

The lea instruction calculates the effective address as the result of the proper address expression and stores the result in a destination operand. We can store the effective address in a single register to avoid complex address calculation inside a loop, like in the following example.

; Load effective address to BX
; Table is the beginning of the table in the memory
 
  lea   BX,Table[SI]
 
; Now we can use BX only to make the program run faster:
hoop:
  mov   AX,[BX] ; Take value from table
  inc   BX      ; Next element in the table
  cmp   AX,0    ; Check if element is 0
  jne   hoop    ; Jump to „hoop” if AX isn’t 0
Because the lea instruction adds source arguments, it is sometimes used instead of the add instruction.

Undefined instructions

The undefined instructions can be used to test the behaviour of the system software in case of the appearance of an unknown opcode in the instruction stream. The ud and ud1 instructions can have a source operand (register or memory address) and a destination operand (register). Operands are not used. The ud2 instruction does not have an operand. Executing any undefined instruction results in an invalid opcode exception (#UD) throw.

Table lookup

The xlatb instruction copies the byte from a table into the AL register. The byte is addressed as the sum of the BX/EX/RBX and AL registers. There is also an xlat version, which enables specifying the address in the memory as the argument. It can be somewhat misleading because the argument is never used by the processor. This instruction can be used to implement the conversion from a 4-digit binary value into a hexadecimal digit, as in the following code.

.DATA
conv_table DB ”0123456789ABCDEF”
 
.CODE
; Load base address of table to BX
  lea   RBX, conv_table
  and   AL, 0Fh  ; Limit AL to 4 bits
  xlatb          ; Take element from the table
  mov   char, AL ; Resulting char is in AL

Processor identification

The cpuid instruction provides processor identification information. It operates similarly to the function, with the input value sent via an accumulator (EAX). Depending on the EAX value gives different information about the processor. The requested information is returned in processor registers. For example, if EAX is zero, it returns the vendor information string: “GenuineIntel” for Intel processors, “AuthenticAMD” for AMD models in ECX, EDX and EBX registers. It is shown in figure 13.

Illustration of vendor string reading by cpuid instruction
Figure 13: Illustration of vendor string reading by cpuid instruction

MOVBE instruction

The movbe instruction moves data after swapping data bytes. It operates on words, doublewords or quadwords and is usually used to change the endianness of the data.

Cache manipulating instructions

Cache memory is managed by the processor, and usually, its decisions keep the performance of software execution at a good level. However, the processor offers instructions that allow the programmer to send hints to the cache management mechanism and prefetch data in advance of using it (prefetchw, prefetchwt1) and to synchronise the cache and memory and flush the cache line to make it available for other data (clflush, clflushopt). There are also additional instructions implemented for cache management introduced together with multimedia and vector extensions.

User Mode Extended State Save/Restore Instructions

Some instructions allow for saving and restoring the state of several units of the processor. They are intended to help processors in fast context switching between processes and to be used instead of saving each register separately at the beginning of a subroutine and restoring it at the end. The content of registers is stored in memory pointed by EDX:EAX registers. Instructions for saving the state are xsave, xsavec, and xsaveopt. Instructions for restoring the state are xrstor and xgetbv.

Random Number Generator Instructions

In the x64 architecture, there are two instructions for generating a random number. These are rdseed and rdrand. A random number is generated by a specially designed hardware unit. The difference between instructions is that rdseed gets random bits generated from entropy gathered from a sensor on the chip. It is slower but offers better randomness of the number. The rdrand gets bits from a pseudorandom number generator. It is faster, offering output that is sufficiently secure for most cryptographic applications.

BMI1 and BMI2 Instructions

The abbreviation BMI comes from Bit Manipulation Instructions. These instructions are designed for some specific manipulation of bits in the arguments, enabling programmers to use a single instruction instead of a few. The andn instruction extends the group of logical instructions. It performs a bitwise AND of the first source operand with the inverted second source operand. There are additional shift and rotate instructions that do not affect flags, which allows for more predictable execution without dependency on flag changes from previous operations. . These instructions are rorx - rotate right, sarx - shift arithmetic right, shlx - shift logic left, and shrx - shift logic right. Also, unsigned multiplication without affecting flags, mulx, was introduced. Other instructions manipulate bits as the group name stays.

The lzcnt instruction counts the number of zeros in an argument starting from the most significant bit. The tzcnt counts zeros starting from the least significant bit. For an argument that is not zero, lzcnt returns the number of zeros before the first 1 from the left, and tzcnt gives the number of zeros before the first 1 from the right. The bextr instruction copies the number of bits from source to destination arguments starting at the chosen position. The third argument specifies the number of bits and the starting bit position. Bits 7:0 of the third operand specify the starting bit position, while bits 15:8 specify the maximum number of bits to extract, as shown in figure 14.

Illustration of bit extraction instruction
Figure 14: Illustration of bit extraction instruction

The blsi instruction extracts the single, lowest bit set to one, as shown in figure 15.

Illustration of the lowest set bit extraction instruction
Figure 15: Illustration of lowest set bit extraction instruction

The blsmsk instruction sets all lower bits below a first bit set to 1. It is shown in figure 16.

Illustration of the instruction which sets all lower bits below a first bit set to 1.
Figure 16: Illustration of the instruction which sets all lower bits below a first bit set to 1

The blsr instruction resets (clears the bit to zero value) the lowest set bit. It is shown in figure 17.

Illustration of the instruction which resets a first bit set to 1.
Figure 17: Illustration of the instruction which resets a first bit set to 1

The bzhi instruction resets high bits starting from the specified bit position, as shown in figure 18.

Illustration of the instruction which resets high bits starting from the specified bit position.
Figure 18: Illustration of the instruction which resets high bits starting from the specified bit position

The pdep instruction performs a parallel deposit of bits using a mask. Its behaviour is shown in figure 19.

Illustration of the parallel deposit instruction
Figure 19: Illustration of the parallel deposit instruction

The pext instruction performs a parallel extraction of bits using a mask. Its behaviour is shown in figure 20.

Illustration of the parallel extraction instruction
Figure 20: Illustration of the parallel extraction instruction