====== FPU ======
The Floating Point Unit is developed to speed up calculations on real numbers, encoded in a computer as floating-point numbers. At the beginning of the x86 processors' history, the FPU was a separate integrated circuit. Since the i486DX, it has been introduced as a standard element of each processor. The FPU operates on single, double or extended precision values, using its own set of registers and instructions. For details about FPU registers, please refer to the section "Register Set".
<note>
Although modern extensions for processors can currently perform calculations with real numbers faster with the use of vector instructions, they do not achieve the precision available with the extended precision encoding.
</note>
Registers in the FPU are organised as a stack with 8 levels of depth. Physical registers are named R0 - R8, while registers visible by the FPU take names ST(0) to ST(7). The ST(0), also referred to as ST, is always the top of the stack, while ST(7) is the bottom of the stack. The top of the stack is pointed to with three bits in the FPU status word register. Each time the data is loaded to the FPU, the stack top is decremented, and each time the data is popped off the stack, it is incremented. The initial state is as shown in figure {{ref>fpuinit}}.

<figure fpuinit>
{{ :en:multiasm:cs:fpu_initial.png?200 |Illustration of the initial state of FPU registers and stack}}
<caption>The initial state of FPU registers and stack</caption>
</figure>

While data is loaded to the FPU, it is pushed onto the stack, as shown in figure {{ref>fpuload}}.

<figure fpuload>
{{ :en:multiasm:cs:fpu_load.png?200 |Illustration of pushing data onto the FPU stack}}
<caption>The FPU stack after pushing single data</caption>
</figure>

The stack organisation of registers makes it easier to implement math calculations according to the RPN (Reverse Polish Notation), also called postfix notation.
Further in this section, we'll present the FPU coprocessor's instructions. They can be grouped as:
  * data transfer instructions, 
  * load constants instructions, 
  * basic arithmetic instructions, 
  * comparison instructions, 
  * transcendental instructions, 
  * FPU control instructions.

===== Data transfer instructions =====

Data in the memory used by FPU can be stored as a single precision, double precision or double extended-precision floating point value or as an integer of the type word, double word, quadword, or 18-digit BCD number. The FPU always converts it into double-extended precision floating point while loading it into an internal register and converts the data to the required format while storing it back into memory.
Loading instructions, always first decrement the stack top field in the status word register and next, load the value onto the new top of the register stack.
The **fld** instruction loads a single precision, double precision or double extended-precision onto the FPU register stack. It can also copy data from other FPU register into ST. The **fild** instruction loads integer values of word, doubleword or quadword. The **fbld** instruction loads an 18-digit binary decimal encoded value.
Store instructions take data off the FPU register stack and place it in the memory.
The **fst** instruction stores a single precision or a double precision value to the memory. It can also copy a value to another FPU register. The **fstp** instruction also works with 80-bit double extended precision values and additionally pops data off the stack by incrementing the stack top field in the status word register. It can also copy a value to another FPU register, popping it off the FPU register stack.
The **fist** converts the value from the top of the FPU register stack into a word or doubleword integer and stores it in memory. The **fistp** also can store a 64-bit quadword integer and additionally pops the value from the stack top.
The **fbstp** pops value off the FPU register stack and writes it as an 80-bit BCD encoded integer in memory.

<note>
Please note that it is not possible to exchange values between the FPU stack and CPU registers directly. It is also not possible to load the constant encoded as an immediate value. You can do it with the use of temporal variables placed in memory.
</note>

The **fxch** instruction exchanges values in two FPU registers, where one of them is the top of the stack. This instruction, used without any argument, exchanges ST and ST(1).
The **fcmov//cc//** instructions provide the conditional data transfer. They are introduced in the P6 processor to avoid conditional jumps. They test the condition based on flags in the EFLAGS register. The source operand can be any FPU register, while the destination is always ST(0). The **fcmov//cc//** instructions do not modify the stack top in the FPU.
There are eight such instructions summarised in the table {{ref>fpufcmovcc}}
<table fpufcmovcc>
<caption>Variations of **fcmov//cc//** instruction</caption>
^ Mnemonic ^ flags checked ^ description ^
| **fcmove** | ZF=1 | equal |
| **fcmovne** | ZF=0 | not equal |
| **fcmovb**  | CF=1 | below |
| **fcmovbe** | CF=1 or ZF=1 | below or equal |
| **fcmovnb** | CF=0 | not below |
| **fcmovnbe** | CF=0 and ZF=0 | not below or equal |
| **fcmovu** | PF=1 | unordered |
| **fcmovnu** | PF=0 | not unordered |
</table>
<note>
Unordered means that at least one of the arguments of the comparison instruction does not represent a proper numerical value (is NaN).
</note>

===== Load constants instructions =====
Some constant values can be pushed onto the FPU register stack without the need to define them in memory. Such loading is faster than in instructions that access memory. They are summarised in the table {{ref>fload_constants}}.
<table fload_constants>
<caption>Load constants instructions</caption>
^ Mnemonic ^ value loaded into ST ^
| **fldz** |  0  |
| **fld1** |  1  |
| **fldpi**  |{{ :en:multiasm:cs:fldpi.png?105 }}|
| **fldl2e** |{{ :en:multiasm:cs:fldl2e.png?105 }}|
| **fldl2t** |{{ :en:multiasm:cs:fldl2t.png?105 }}|
| **fldlg2** |{{ :en:multiasm:cs:fldlg2.png?105 }}|
| **fldln2** |{{ :en:multiasm:cs:fldln2.png?105 }}|
</table>

===== Basic arithmetic instructions =====
This group of instructions contains addition, subtraction, multiplication and division instructions in various versions. Arguments of these instructions determine their behaviour. Let's consider some examples.
If the instruction has a single argument, it must be a memory argument, which specifies a single precision or a double precision floating point number. The result is always stored in ST(0). The version with two arguments works with registers only. The order of arguments determines the order of the calculation and result placement. For example, **fsub ST(0), ST(i)** subtracts ST(i) from ST(0) and stores the result in ST(0). The **fsub ST(i), ST(0)** subtracts ST(0) from ST(i) and stores the result in ST(i). The popped version with two arguments additionally pops the stack. For example, **fsubp ST(i), ST(0)** subtracts ST(0) from ST(i), stores the result in ST(i) and pops the stack. No argument version implies ST(1) as the destination and ST(0) as the source argument. For example, **fsubp** subtracts ST(0) from ST(1), stores the result in ST(1) and pops the stack. The result is then at the stack top. Basic arithmetic instructions are summarised in table {{ref>ffparithmetic}}, //float// represents the single precision argument in memory, //double// represents the double precision argument in memory. The ST(i) is the i-th FPU register.
<table ffparithmetic>
<caption>Basic floating point arithmetic instructions</caption>
^ Mnemonic ^ operation ^ result ^ pop ^
|ADDITION| | | |
| **fadd float** |  ST(0) + float  |  ST(0)  |  no  |
| **fadd double** |  ST(0) + double  |  ST(0)  |  no  |
| **fadd ST(0), ST(i)**  |  ST(0) + ST(i)  |  ST(0)  |  no  |
| **fadd ST(i), ST(0)** |  ST(i) + ST(0)  |  ST(i)  |  no  |
| **faddp ST(i), ST(0)** |  ST(i) + ST(0)  |  ST(i)  |  yes  |
| **faddp** |  ST(1) + ST(0)  |  ST(1)  |  yes  |
|SUBTRACTION| | | |
| **fsub float** |  ST(0) - float  |  ST(0)  |  no  |
| **fsub double** |  ST(0) - double  |  ST(0)  |  no  |
| **fsub ST(0), ST(i)**  |  ST(0) - ST(i)  |  ST(0)  |  no  |
| **fsub ST(i), ST(0)** |  ST(i) - ST(0)  |  ST(i)  |  no  |
| **fsubp ST(i), ST(0)** |  ST(i) - ST(0)  |  ST(i)  |  yes  |
| **fsubp** |  ST(1) - ST(0)  |  ST(1)  |  yes  |
|MULTIPLICATION| | | |
| **fmul float** |  ST(0) * float  |  ST(0)  |  no  |
| **fmul double** |  ST(0) * double  |  ST(0)  |  no  |
| **fmul ST(0), ST(i)**  |  ST(0) * ST(i)  |  ST(0)  |  no  |
| **fmul ST(i), ST(0)** |  ST(i) * ST(0)  |  ST(i)  |  no  |
| **fmulp ST(i), ST(0)** |  ST(i) * ST(0)  |  ST(i)  |  yes  |
| **fmulp** |  ST(1) * ST(0)  |  ST(1)  |  yes  |
|DIVISION| | | |
| **fdiv float** |  ST(0) / float  |  ST(0)  |  no  |
| **fdiv double** |  ST(0) / double  |  ST(0)  |  no  |
| **fdiv ST(0), ST(i)**  |  ST(0) / ST(i)  |  ST(0)  |  no  |
| **fdiv ST(i), ST(0)** |  ST(i) / ST(0)  |  ST(i)  |  no  |
| **fdivp ST(i), ST(0)** |  ST(i) / ST(0)  |  ST(i)  |  yes  |
| **fdivp** |  ST(1) / ST(0)  |  ST(1)  |  yes  |
</table>
The addition and multiplication operations are commutative, while subtraction and division are not. That's why the reversed versions of subtraction and addition are implemented. The difference is the order of operations, while the destination remains the same as in non-reversed versions.

<table frevarithmetic>
<caption>Reversed floating point arithmetic instructions</caption>
^ Mnemonic ^ operation ^ result ^ pop ^
|REVERSED SUBTRACTION| | | |
| **fsubr float** |  float - ST(0)  |  ST(0)  |  no  |
| **fsubr double** |  double - ST(0)  |  ST(0)  |  no  |
| **fsubr ST(0), ST(i)**  |  ST(i) - ST(0)  |  ST(0)  |  no  |
| **fsubr ST(i), ST(0)** |  ST(0) - ST(i)  |  ST(i)  |  no  |
| **fsubrp ST(i), ST(0)** |  ST(0) - ST(i)  |  ST(i)  |  yes  |
| **fsubrp** |  ST(0) - ST(1)  |  ST(1)  |  yes  |
|REVERSED DIVISION| | | |
| **fdivr float** |  float / ST(0)  |  ST(0)  |  no  |
| **fdivr double** |  double / ST(0)  |  ST(0)  |  no  |
| **fdivr ST(0), ST(i)**  |  ST(i) / ST(0)  |  ST(0)  |  no  |
| **fdivr ST(i), ST(0)** |  ST(0) / ST(i)  |  ST(i)  |  no  |
| **fdivrp ST(i), ST(0)** |  ST(0) / ST(i)  |  ST(i)  |  yes  |
| **fdivrp** |  ST(0) / ST(1)  |  ST(1)  |  yes  |
</table>

There are also versions of four basic arithmetic instructions which operate with an integer memory argument. It can be a word or a doubleword.
<table fintarithmetic>
<caption>Basic integer arithmetic instructions</caption>
^ Mnemonic ^ operation ^ result ^ pop ^
|ADDITION| | | |
| **fiadd word** |  ST(0) + word  |  ST(0)  |  no  |
| **fiadd doubleword** |  ST(0) + doubleword  |  ST(0)  |  no  |
|SUBTRACTION| | | |
| **fisub word** |  ST(0) - word  |  ST(0)  |  no  |
| **fisub doubleword** |  ST(0) - doubleword  |  ST(0)  |  no  |
|REVERSED SUBTRACTION| | | |
| **fisubr word** |  word - ST(0)  |  ST(0)  |  no  |
| **fisubr doubleword** |  doubleword - ST(0)  |  ST(0)  |  no  |
|MULTIPLICATION| | | |
| **fimul word** |  ST(0) * word  |  ST(0)  |  no  |
| **fimul doubleword** |  ST(0) * doubleword  |  ST(0)  |  no  |
|DIVISION| | | |
| **fidiv word** |  ST(0) / word  |  ST(0)  |  no  |
| **fidiv doubleword** |  ST(0) / doubleword  |  ST(0)  |  no  |
|REVERSED DIVISION| | | |
| **fidivr word** |  word / ST(0)  |  ST(0)  |  no  |
| **fidivr doubleword** |  doubleword / ST(0)  |  ST(0)  |  no  |
</table>

The basic arithmetic instructions also contain instructions for other calculations. The **fprem** and **fprem1** calculate the partial remainder obtained from dividing the value in the ST(0) register by the value in the ST(1) register. The **fabs** calculate the absolute value of ST(0). The **fchs** changes the sign of ST(0). The **frndint** rounds the ST(0) to an integer. The **fscale** scales ST(0) by a power of two taken from ST(1), while **fxtract** separates the value in ST(0) into the exponent placed in ST(0) and the significand, which is pushed onto the stack. As a result, the exponent is in ST(1) and the significand in ST(0). It is also possible to calculate the square root of ST(0) with the **sqrt** instruction.

===== Comparison instructions =====
The comparison instructions compare two floating point values and set flags appropriate to the result. The operand of the **fcom** instruction can be a memory operand or another FPU register. It is always compared with the top of the stack. If no operand is specified, it compares ST(0) and ST(1). Popped version **fcomp** pops ST(0) off the stack. The instruction **fcompp** with double "P" at the end can't have any argument, compares ST(0) and ST(1) and pops both registers off the stack.
If one of the arguments is NaN, they generate the invalid arithmetic operand exception. To avoid unwanted exceptions, there are unordered versions of comparison instructions. These are **fucom**, **fucomp**, and **fucompp**. Unordered comparison instructions do not operate with memory arguments. Two instructions are implemented to compare integers. The **ficom** and **ficomp** have a single memory argument that can be a word or doubleword, which is compared with the top of the stack.
Original instructions set flags C0, C2 and C3 in the FPU status word register. After implementing FPU as the integral unit of the processor, a new set of instructions appeared that set flags in the FLAGS register directly. There are **fcomi**, **fcomip**, **fucomi** and **fucomip**. Their first argument is always ST(0), the second is another FPU register.
To the group of the comparison instructions also belong **fxam** and **ftst** instructions. The **fxam** instruction classifies the value of ST(0), while the **ftst** instruction compares ST(0) with the value of 0.0. They return the information in C0, C2 and C3 flags.

===== Transcendental instructions =====
The transcendental instructions perform calculations of advanced mathematical functions.
The **fsin** instruction calculates the sine, while the **fcos** calculates the cosine of the argument stored in ST(0). The **fsincos** calculates both sine and cosine with the same instruction. The sine is returned in ST(1), the cosine in ST(0). The **fptan** instruction calculates the partial tangent and **fpatan** the partial arctangent. After calculating the tangent, the value of 1.0 is pushed onto the stack to make it easier to calculate cotangent afterwards by execution **fdivr** instruction. The partial means that this instruction handles only a limited range of input arguments.
The instructions for exponential and logarithmic functions are summarised in table {{ref>ftrans}}.
<table ftrans>
<caption>Transcendental arithmetic instructions</caption>
^ Mnemonic     ^ operation                              ^ note on operands        ^
| **f2xm1**    | {{ :en:multiasm:cs:f2xm1.png?105 }}    |                         |
| **fyl2x**    | {{ :en:multiasm:cs:fyl2x.png?105 }}    | y is ST(1); x is ST(0)  |
| **fyl2xp1**  | {{ :en:multiasm:cs:fyl2xp1.png?105 }}  | y is ST(1); x is ST(0)  |
</table>

===== FPU control instructions =====
The FPU control instructions help the programmer to save and restore the contents of chosen registers if there is a need to use them in an interrupt handler or inside a function. It is also possible to initialise the state of the FPU unit or clear errors.
The **fincstp** increments and **fdecstp** decrements the FPU register stack pointer.
The following set of instructions can perform error checking while execution (instructions without "N") or perform the operation without checking for error conditions (instructions without "N").
The **finit** and **fninit** initialise the FPU (after checking error conditions or without checking error conditions).
The **fclex** and **fnclex** clear floating-point exception flags.
The **fstcw** and **fnstcw** store the FPU control word.
The **fldcw** loads the FPU control word.
The **fstenv** and **fnstenr** store the FPU environment. The environment consists of the FPU control word, status
word, tag word, instruction pointer, data pointer, and last opcode register.
The **fldenv** loads the FPU environment.
The **fsave** and **fnsave** save the FPU state. The state is the operating environment and full register stack.
The **frstor** restores the FPU state.
The **fstsw** and **fnstsw** store the FPU status word. There is no instruction for restoring the status word.
The **wait** or **fwait** waits for the FPU to finish the operation.
The **fnop** instruction is the no operation instruction for the FPU.