An assembler, understood as software that translates assembler source code into machine code, can be implemented in various ways. While the processor's instructions remain constant, other language elements may be implementation-specific. In this chapter, we present the most important language elements specific to the MASM assembly implementation. For a detailed description of MASM assembler language implementation, please refer to the Microsoft© website[1].
An alphabet is a set of characters which can be used in writing programs. In MASM, they include small and capital letters, digits, hidden characters and special characters.
Special characters may have a defined meaning, and not all of them can be used freely in the program.
Hidden ASCII characters.
Reserved words, known also as keywords, are words that have special meaning in MASM. They represent elements of the language defined by MASM creators, and their use is reserved for special purposes only. They can’t be used as label names, variable names, constant names and similar user-defined items. They are:
Symbolic names are words defined by the programmer used for identifying elements of the program. They are used to name constants, variables, addresses, segments and other items of the source code. Certain specific rules must be followed when creating the symbolic name. Symbolic name can’t begin with a digit and can consist of letters, digits and four special characters: $, @, _, and ?. Upper-case and lower-case letters are treated as the same, and only the first 31 characters are recognised. Examples of proper symbolic names.
ABC123_4 Number_602602602_@1 ?quest $125 __Right_Here
Examples of improper symbolic names.
12_cats ? ‘name’ Hello.world Right Here
Operators are used in expressions calculated during assembly. They enable performing arithmetic and logic calculations with numeric expressions and address expressions. Some operators are associated with macros or other specific elements of assembler language. We'll present some of them. For details about all operators, please refer to the MASM documentation [2].
The operators which can be used in numeric expressions are
The operators which can be used in numeric and address expressions are
The type operators are used in address expressions. They determine the number of bytes in a single variable, the number of elements in a data array, or the size of a whole data array 1. They are very useful because they can automatically recalculate, for example, the number of iterations necessary to review a string of characters or a data array when its length changes.
| operator | value |
|---|---|
| TYPE | type (number of bytes in one variable) |
| LENGTH | number of elements in one-dimensional array |
| SIZE | number of bytes in one-dimensional array |
| LENGTHOF | number of elements in a multi-dimensional array |
| SIZEOF | number of bytes in a multi-dimensional array |
| PTR | type cast operator |
The PTR operator is similar to type casting in other programming languages. In some cases, it is required to specify the size of an operand. For example, if we have the indirect increment instruction. The assembler can't determine the size of the operand in memory pointed with the RBX register, which is why we have to specify the operand size with the PTR operator.
inc [RBX] ; Error! - argument size is not specified inc BYTE PTR [RBX] ; Increment one byte addressed with RBX inc WORD PTR [RBX] ; Increment word addressed with RBX mov [RSI], AX ; Store word from AX to memory - AX use determines the size mov [RSI], 5 ; Error! - constant operand does not determine the size mov BYTE PTR [RSI], 5 ; Store 8-bit value mov WORD PTR [ESI], 5 ; Store 16-bit value mov DWORD PTR [ESI], 5 ; Store 32-bit value
An important operator used in data definitions is DUP. It specifies the number of repetitions of the initial value. We'll present details of it later in this chapter.
Programs in modern 64-bit operating systems are divided into code and data sections. The operating system maintains the stack, and currently, no stack section is defined in user programs. To start the code section, the .CODE directive is used. The code section contains all instructions in a program. To identify the beginning of the data section, the .DATA directive is used. The data section contains all the variables used in a program.
The location counter is an internal variable, maintained by the assembler, to assign addresses to program items. During assembly, it performs a similar role as the instruction pointer during program execution. The location counter contains the address of the currently processed variable in a data section and the instruction in a code section. Any directive which starts a section defines a new location counter and sets it to 0. If the same section is continued in another place in a program, the location counter increments continuously throughout the whole section. Assembling subsequent bytes increases the content of the location counter by 1. While the SEGMENT and ENDS directives are used, the SEGMENT directive, used for the specific section (segment) for the first time, creates the location counter for this section. The ENDS directive suspends byte counting in a given location counter until the next fragment of the section with the same name starts with another SEGMENT directive. The current value of the location counter can be retrieved with the $ sign.
The ORG directive sets the location counter to the specified value x. It is used to align the parts of the program to a specific address.
The EVEN directive aligns the next variable or instruction on an even byte. As the data elements in modern processors require alignment to addresses divisible by 16, the ALIGN directive is often used instead of EVEN.
The ALIGN directive aligns the next variable or instruction on an address of a byte that is a multiple of the argument. The argument of ALIGN must be a power of two. Empty spaces are filled with zeros for the data section or appropriately-sized NOP instructions for the code section. Note that ALIGN 2 is equal to EVEN.
The LABEL directive creates a new label by assigning the current location-counter value and the given type to the defined name. Usually in a program, the : (colon) sign is used for label definition, but the LABEL directive enables specifying the type of element which the label points to.
There is a set of directives for defining variables. They enable the assignment of a name to a variable and the specification of its type. They are summarised in a table 2.
| Name | data type | data size | comment |
|---|---|---|---|
| DB | byte | 1 byte | |
| BYTE | byte | 1 byte | |
| SBYTE | signed byte | 1 byte | |
| DW | word | 2 bytes | |
| WORD | word | 2 bytes | |
| SWORD | signed word | 2 bytes | |
| DD | doubleword | 4 bytes | |
| DWORD | doubleword | 4 bytes | |
| SDWORD | signed doubleword | 4 bytes | |
| DF | farword | 6 bytes | used as a pointer in 32-bit mode |
| FWORD | farword | 6 bytes | used as a pointer in 32-bit mode |
| DQ | quadword | 8 bytes | |
| QWORD | quadword | 8 bytes | |
| SQWORD | signed quadword | 8 bytes | |
| DT | 10 bytes | 10 bytes | used as 80-bit BCD integer for FPU |
| TBYTE | 10 bytes | 10 bytes | used as 80-bit BCD integer for FPU |
| OWORD | octalword | 16 bytes | |
| REAL4 | single precision | 4 bytes | floatng point for FPU |
| REAL8 | double precision | 8 bytes | floatng point for FPU |
| REAL10 | extended double precision | 10 bytes | floatng point for FPU |
Variable definition directives can be used to define single variables, data tables or strings. The list of operands determines it. It is allowed to use ? as an operand signalling that the initialisation value remains undefined.
var_x DB 10 ; single byte variable with initial value 10 var_y DW 20 ; single word variable with initial value 20 var_z DD ? ; single uninitialised doubleword table_a DQ 1, 2, 3, 4, 5 ; table of five quadwords string_b BYTE "I like assembler" ; string with ASCII codes of all characters
Previously mentioned DUP operator and type operators can be explained with some exemplary data definitions.
; TYPE LENGHT SIZE A DB 10 DUP (?) ; 1 10 10 AB DW 10 DUP (?) ; 2 10 20 ABC DD 10 DUP (?) ; 4 10 40 AD DB 5 DUP (5 DUP (5 DUP(?))) ; 1 125 125
An example which shows the DUP and SIZEOF operators together with data definitions is in the following code. This code defines the uninitialised 256-byte data buffer and fills it with zeros. Please note that in the mov [RBX], 0 instruction, BYTE PTR must be used, because neither [RBX] nor 0 determines the operand size.
.DATA buffer DB 256 DUP (?) .CODE lea RBX, buffer mov RCX, SIZEOF buffer clear: mov BYTE PTR [RBX], 0 inc RBX loop clear
Constants in an assembler program define the name for the value that can't be changed during normal program execution. It is the assembly-time assignment of the value and its name. Although their name suggests that their value can't be altered, it is true at the program run-time. Some forms of constants can be modified during assembly time. Usually, constants are used to self-document the code, parameterise the assembly process, and perform assembly-time calculations.
The constants can be integer, floating-point numeric, or text strings.
Integer numeric constants can be defined with the data assignment directives, EQU or the equal sign =. The difference is that a numeric constant defined with the EQU directive can’t be modified later in the program, while a constant created with the equal sign can be redefined many times in the program. Numeric constants can be expressed as binary, octal, decimal or hexadecimal values. They can also be a result of an expression calculated during assembly time. It is possible to use a previously defined constant in such an expression.
int_const1 EQU 5 ; no suffix by default decimal value int_const_dec = 7 ; finished with "d", "D", "t", "T", or by default without suffix int_const_binary = 100100101b ; finished with "b", "B", "y", or "Y" int_const_octal = 372o ; finished with "o", "O", "q", or "Q" int_const_hex = 0FFA4h ; finished with "h", or "H" int_const_expr = int_const_dec * 5
Floating-point numeric constants can be defined with the EQU directive only. The number can be expressed in decimal or scientific notation.
real_const1 EQU 3.1415 ; decimal real_const2 EQU 6.28e2 ; scientific
Text string constants can be defined with EQU or TEXTEQU directives. Text constants assigned with the EQU or TEXTEQU directive can be redefined later in the program. The TEXEQU is considered a text macro and is described in the section about macros.
text_const1 EQU 'Hello World!' text_const2 EQU "Hello World!"
The condition assembly directives have the same functionality as in high-level language compilers. They control the assembly process by checking the defined conditions and enabling or disabling the process for fragments of the source code.
The conditional assembly statement, together with optional ELSEIF and ELSE statements, is shown in the following code.
IF expression1 statements [[ELSEIF expression2 statements ]] [[ELSE statements ]] ENDIF
Statements in assembler programs written in MASM are the lines of code composing the source files. Each MASM statement specifies an instruction for the processor or a directive for the assembler. Statements have up to four fields.
All fields in a statement are optional. A statement can be composed of a label only (ended with a colon), an operation only (if it doesn't require operands), or a comment only. A few examples of proper statements are presented in the following code.
; name ; operation ; operands ; comment cns_y EQU 134 ; definition of a constant named cns_y with the value 134 .DATA ; operation only - directive to start data section var_x DB 123 ; definition of a variable named var_x with init value 123 .CODE ; operation only - directive to start code section begin: ; name only - label that represents an address mov rax, rbx ; operation and corresponding operands ; comment only statement END ; operation only - end of the source file