====== MASM basics ====== An assembler, understood as software that translates assembler source code into machine code, can be implemented in various ways. While the processor's instructions remain constant, other language elements may be implementation-specific. In this chapter, we present the most important language elements specific to the MASM assembly implementation. For a detailed description of MASM assembler language implementation, please refer to the Microsoft© website((https://learn.microsoft.com/en-us/cpp/assembler/masm/microsoft-macro-assembler-reference?view=msvc-170)). ===== Alphabet ===== An alphabet is a set of characters which can be used in writing programs. In MASM, they include small and capital letters, digits, hidden characters and special characters. * Letters: A…Z, a…z * Digits: 0…9 * Hidden ASCII characters: 09h, 20h, 0Dh, 0Ah * Special characters: + - * / = ( ) [ ] < > . , ‘ ” _ : ? @ $ & % Special characters may have a defined meaning, and not all of them can be used freely in the program. * , – comma – separates the operands, * ‘…’ – apostrophe – text delimiter, * ”…” – quotation marks – text delimiter, * (…) – round brackets – determine order of expression counting, * ; – semicolon – begins the comment, * : – colon – delimits the labels and segment prefixes, * . – dot – used in record data types, begins some of the directives, * & – ampersand – used in macros to replace a formal argument with an actual value, * % – percent – expand operator used in macros, * <…> – angle brackets – text delimiter used in macros, * […] – square brackets – used in address expressions, * $ – dollar sign – actual value of instruction pointer, * = – equal sign – directive to define constants, * ? – question mark – indefinite value, * @ – at - begins predefined names, * _ – underline – often used in symbolic names instead of the space, * + - * / – mathematical operators. Hidden ASCII characters. * 0D0Ah – CRLF (enter) – end of line, * 20h – space – separates items in a line * 09h – tabulation – used instead of space to improve readability of the code ===== Keywords and symbolic names ===== Reserved words, known also as keywords, are words that have special meaning in MASM. They represent elements of the language defined by MASM creators, and their use is reserved for special purposes only. They can’t be used as label names, variable names, constant names and similar user-defined items. They are: * Instructions, * Directives, * Attributes, * Operators, * Predefined symbols. Symbolic names are words defined by the programmer used for identifying elements of the program. They are used to name constants, variables, addresses, segments and other items of the source code. Certain specific rules must be followed when creating the symbolic name. Symbolic name can’t begin with a digit and can consist of letters, digits and four special characters: $, @, _, and ?. Upper-case and lower-case letters are treated as the same, and only the first 31 characters are recognised. Examples of proper symbolic names. ABC123_4 Number_602602602_@1 ?quest $125 __Right_Here Examples of improper symbolic names. 12_cats ? ‘name’ Hello.world Right Here ===== Operators ===== Operators are used in expressions calculated during assembly. They enable performing arithmetic and logic calculations with numeric expressions and address expressions. Some operators are associated with macros or other specific elements of assembler language. We'll present some of them. For details about all operators, please refer to the MASM documentation ((https://learn.microsoft.com/en-us/cpp/assembler/masm/operators-reference?view=msvc-170)). The operators which can be used in numeric expressions are * ** * **, ** / ** - multiplication and division * **MOD** - remainder of an integer division * **SHL**, **SHR** - shift left and right * **OR**, **XOR**, **AND**, **NOT** - logical functions The operators which can be used in numeric and address expressions are * ** + **, ** - ** - addition, subtraction * **HIGH**, **LOW** - high/low 8 bits of a 16-bit variable/address * **HIGHWORD**, **LOWWORD** - high/low 16 bits of 32-bit variable/address * **HIGH32**, **LOW32** - high/low 32 bits of 64-bit variable/address The type operators are used in address expressions. They determine the number of bytes in a single variable, the number of elements in a data array, or the size of a whole data array {{ref>masmtypeoperators}}. They are very useful because they can automatically recalculate, for example, the number of iterations necessary to review a string of characters or a data array when its length changes. ^ operator ^ value ^ | **TYPE** | type (number of bytes in one variable) | | **LENGTH** | number of elements in one-dimensional array | | **SIZE** | number of bytes in one-dimensional array | | **LENGTHOF** | number of elements in a multi-dimensional array | | **SIZEOF** | number of bytes in a multi-dimensional array | | **PTR** | type cast operator |
MASM type operators
The following dependencies occur:\\ SIZE = TYPE * LENGTH\\ SIZEOF = TYPE * LENGTHOF The **PTR** operator is similar to type casting in other programming languages. In some cases, it is required to specify the size of an operand. For example, if we have the indirect increment instruction. The assembler can't determine the size of the operand in memory pointed with the RBX register, which is why we have to specify the operand size with the PTR operator. inc [RBX] ; Error! - argument size is not specified inc BYTE PTR [RBX] ; Increment one byte addressed with RBX inc WORD PTR [RBX] ; Increment word addressed with RBX mov [RSI], AX ; Store word from AX to memory - AX use determines the size mov [RSI], 5 ; Error! - constant operand does not determine the size mov BYTE PTR [RSI], 5 ; Store 8-bit value mov WORD PTR [ESI], 5 ; Store 16-bit value mov DWORD PTR [ESI], 5 ; Store 32-bit value An important operator used in data definitions is **DUP**. It specifies the number of repetitions of the initial value. We'll present details of it later in this chapter. ===== Code and data sections ===== Programs in modern 64-bit operating systems are divided into code and data sections. The operating system maintains the stack, and currently, no stack section is defined in user programs. To start the code section, the **.CODE** directive is used. The code section contains all instructions in a program. To identify the beginning of the data section, the **.DATA** directive is used. The data section contains all the variables used in a program. Up to 32-bit processors, the functional fragments of programs were referred to as segments. It was because they were assigned to segment registers in the processor. Currently, the segmentation mechanism is no longer operational, so the code and data fragments of programs are named sections. However, in many literature sources and internet websites, the name segment can still be frequently found. ===== Location counter ==== The location counter is an internal variable, maintained by the assembler, to assign addresses to program items. During assembly, it performs a similar role as the instruction pointer during program execution. The location counter contains the address of the currently processed variable in a data section and the instruction in a code section. Any directive which starts a section defines a new location counter and sets it to 0. If the same section is continued in another place in a program, the location counter increments continuously throughout the whole section. Assembling subsequent bytes increases the content of the location counter by 1. While the **SEGMENT** and **ENDS** directives are used, the **SEGMENT** directive, used for the specific section (segment) for the first time, creates the location counter for this section. The **ENDS** directive suspends byte counting in a given location counter until the next fragment of the section with the same name starts with another **SEGMENT** directive. The current value of the location counter can be retrieved with the **$** sign. ===== Selected directives ===== The **ORG** directive sets the location counter to the specified value x. It is used to align the parts of the program to a specific address. The **EVEN** directive aligns the next variable or instruction on an even byte. As the data elements in modern processors require alignment to addresses divisible by 16, the **ALIGN** directive is often used instead of **EVEN**. The **ALIGN** directive aligns the next variable or instruction on an address of a byte that is a multiple of the argument. The argument of **ALIGN** must be a power of two. Empty spaces are filled with zeros for the data section or appropriately-sized **NOP** instructions for the code section. Note that **ALIGN 2** is equal to **EVEN**. The **LABEL** directive creates a new label by assigning the current location-counter value and the given type to the defined name. Usually in a program, the **:** (colon) sign is used for label definition, but the **LABEL** directive enables specifying the type of element which the label points to. ===== Data definition directives ===== There is a set of directives for defining variables. They enable the assignment of a name to a variable and the specification of its type. They are summarised in a table {{ref>masmdatadefine}}. ^ Name ^ data type ^ data size ^ comment ^ | **DB** | byte | 1 byte | | | **BYTE** | byte | 1 byte | | | **SBYTE** | signed byte | 1 byte | | | **DW** | word | 2 bytes | | | **WORD** | word | 2 bytes | | | **SWORD** | signed word | 2 bytes | | | **DD** | doubleword | 4 bytes | | | **DWORD** | doubleword | 4 bytes | | | **SDWORD** | signed doubleword | 4 bytes | | | **DF** | farword | 6 bytes | used as a pointer in 32-bit mode | | **FWORD** | farword | 6 bytes | used as a pointer in 32-bit mode | | **DQ** | quadword | 8 bytes | | | **QWORD** | quadword | 8 bytes | | | **SQWORD** | signed quadword | 8 bytes | | | **DT** | 10 bytes | 10 bytes | used as 80-bit BCD integer for FPU | | **TBYTE** | 10 bytes | 10 bytes | used as 80-bit BCD integer for FPU | | **OWORD** | octalword | 16 bytes | | | **REAL4** | single precision | 4 bytes | floatng point for FPU | | **REAL8** | double precision | 8 bytes | floatng point for FPU | | **REAL10** | extended double precision | 10 bytes | floatng point for FPU |
MASM variable definition directives
Variable definition directives can be used to define single variables, data tables or strings. The list of operands determines it. It is allowed to use **?** as an operand signalling that the initialisation value remains undefined. var_x DB 10 ; single byte variable with initial value 10 var_y DW 20 ; single word variable with initial value 20 var_z DD ? ; single uninitialised doubleword table_a DQ 1, 2, 3, 4, 5 ; table of five quadwords string_b BYTE "I like assembler" ; string with ASCII codes of all characters Previously mentioned **DUP** operator and type operators can be explained with some exemplary data definitions. ; TYPE LENGHT SIZE A DB 10 DUP (?) ; 1 10 10 AB DW 10 DUP (?) ; 2 10 20 ABC DD 10 DUP (?) ; 4 10 40 AD DB 5 DUP (5 DUP (5 DUP(?))) ; 1 125 125 An example which shows the **DUP** and **SIZEOF** operators together with data definitions is in the following code. This code defines the uninitialised 256-byte data buffer and fills it with zeros. Please note that in the **mov [RBX], 0** instruction, **BYTE PTR** must be used, because neither [RBX] nor 0 determines the operand size. .DATA buffer DB 256 DUP (?) .CODE lea RBX, buffer mov RCX, SIZEOF buffer clear: mov BYTE PTR [RBX], 0 inc RBX loop clear ===== Constants ===== Constants in an assembler program define the name for the value that can't be changed during normal program execution. It is the assembly-time assignment of the value and its name. Although their name suggests that their value can't be altered, it is true at the program run-time. Some forms of constants can be modified during assembly time. Usually, constants are used to self-document the code, parameterise the assembly process, and perform assembly-time calculations. The constants can be integer, floating-point numeric, or text strings.\\ Integer numeric constants can be defined with the data assignment directives, **EQU** or the equal sign **=**. The difference is that a numeric constant defined with the EQU directive can’t be modified later in the program, while a constant created with the equal sign can be redefined many times in the program. Numeric constants can be expressed as binary, octal, decimal or hexadecimal values. They can also be a result of an expression calculated during assembly time. It is possible to use a previously defined constant in such an expression. int_const1 EQU 5 ; no suffix by default decimal value int_const_dec = 7 ; finished with "d", "D", "t", "T", or by default without suffix int_const_binary = 100100101b ; finished with "b", "B", "y", or "Y" int_const_octal = 372o ; finished with "o", "O", "q", or "Q" int_const_hex = 0FFA4h ; finished with "h", or "H" int_const_expr = int_const_dec * 5 Floating-point numeric constants can be defined with the **EQU** directive only. The number can be expressed in decimal or scientific notation. real_const1 EQU 3.1415 ; decimal real_const2 EQU 6.28e2 ; scientific Text string constants can be defined with **EQU** or **TEXTEQU** directives. Text constants assigned with the **EQU** or **TEXTEQU** directive can be redefined later in the program. The **TEXEQU** is considered a text macro and is described in the section about macros. text_const1 EQU 'Hello World!' text_const2 EQU "Hello World!" ===== Conditional assembly directives ===== The condition assembly directives have the same functionality as in high-level language compilers. They control the assembly process by checking the defined conditions and enabling or disabling the process for fragments of the source code. * **IF** expression, **IFE** expression - tests the value of the expression and performs (or do not) assemble according to the result (0-false), * **IFDEF** symbol - tests whether a symbol is defined, * **IFNDEF** symbol - tests whether a symbol is undefined, * **IFB** - tests whether the string argument is empty, * **IFNB** - tests whether the string argument is empty. The conditional assembly statement, together with optional **ELSEIF** and **ELSE** statements, is shown in the following code. IF expression1 statements [[ELSEIF expression2 statements ]] [[ELSE statements ]] ENDIF ===== Statements ===== Statements in assembler programs written in MASM are the lines of code composing the source files. Each MASM statement specifies an instruction for the processor or a directive for the assembler. Statements have up to four fields. * name - Specifies the name of the program line. This can serve as a label for the instruction, allowing other instructions to refer to it by name. Some directives also require naming, specifying a variable, type, constant, segment, macro, procedure and other elements of the source file. * operation - This is the main element which defines the action of the statement. This can be an instruction for the processor or an assembler directive. * operands - This field depends on the operation. Some operations do not accept operands, some require a list of one or more operands. Operands are also referred to as arguments. * comment - This field is for documentation purposes and is ignored by the assembler. Good comments make it easier to understand and maintain the program. All fields in a statement are optional. A statement can be composed of a label only (ended with a colon), an operation only (if it doesn't require operands), or a comment only. A few examples of proper statements are presented in the following code. ; name ; operation ; operands ; comment cns_y EQU 134 ; definition of a constant named cns_y with the value 134 .DATA ; operation only - directive to start data section var_x DB 123 ; definition of a variable named var_x with init value 123 .CODE ; operation only - directive to start code section begin: ; name only - label that represents an address mov rax, rbx ; operation and corresponding operands ; comment only statement END ; operation only - end of the source file