MASM basics

MASM basics

An assembler, understood as software that translates assembler source code into machine code, can be implemented in various ways. While the processor's instructions remain constant, other language elements may be implementation-specific. In this chapter, we present the most important language elements specific to the MASM assembly implementation. For a detailed description of MASM assembler language implementation, please refer to the Microsoft© website^[1].

Alphabet

An alphabet is a set of characters which can be used in writing programs. In MASM, they include small and capital letters, digits, hidden characters and special characters.

Letters: A…Z, a…z
Digits: 0…9
Hidden ASCII characters: 09h, 20h, 0Dh, 0Ah
Special characters: + - * / = ( ) [ ] < > . , ‘ ” _ : ? @ $ & %

Special characters may have a defined meaning, and not all of them can be used freely in the program.

, – comma – separates the operands,
‘…’ – apostrophe – text delimiter,
”…” – quotation marks – text delimiter,
(…) – round brackets – determine order of expression counting,
; – semicolon – begins the comment,
: – colon – delimits the labels and segment prefixes,
. – dot – used in record data types, begins some of the directives,
& – ampersand – used in macros to replace a formal argument with an actual value,
% – percent – expand operator used in macros,
<…> – angle brackets – text delimiter used in macros,
[…] – square brackets – used in address expressions,
$ – dollar sign – actual value of instruction pointer,
= – equal sign – directive to define constants,
? – question mark – indefinite value,
@ – at - begins predefined names,
_ – underline – often used in symbolic names instead of the space,
+ - * / – mathematical operators.

Hidden ASCII characters.

0D0Ah – CRLF (enter) – end of line,
20h – space – separates items in a line
09h – tabulation – used instead of space to improve readability of the code

Keywords and symbolic names

Reserved words, known also as keywords, are words that have special meaning in MASM. They represent elements of the language defined by MASM creators, and their use is reserved for special purposes only. They can’t be used as label names, variable names, constant names and similar user-defined items. They are:

Instructions,
Directives,
Attributes,
Operators,
Predefined symbols.

Symbolic names are words defined by the programmer used for identifying elements of the program. They are used to name constants, variables, addresses, segments and other items of the source code. Certain specific rules must be followed when creating the symbolic name. Symbolic name can’t begin with a digit and can consist of letters, digits and four special characters: $, @, _, and ?. Upper-case and lower-case letters are treated as the same, and only the first 31 characters are recognised. Examples of proper symbolic names.

ABC123_4
Number_602602602_@1
?quest
$125
__Right_Here

Examples of improper symbolic names.

12_cats
?
‘name’
Hello.world
Right Here

Operators

Operators are used in expressions calculated during assembly. They enable performing arithmetic and logic calculations with numeric expressions and address expressions. Some operators are associated with macros or other specific elements of assembler language. We'll present some of them. For details about all operators, please refer to the MASM documentation ^[2].

The operators which can be used in numeric expressions are

* , / - multiplication and division
MOD - remainder of an integer division
SHL, SHR - shift left and right
OR, XOR, AND, NOT - logical functions

The operators which can be used in numeric and address expressions are

+ , - - addition, subtraction
HIGH, LOW - high/low 8 bits of a 16-bit variable/address
HIGHWORD, LOWWORD - high/low 16 bits of 32-bit variable/address
HIGH32, LOW32 - high/low 32 bits of 64-bit variable/address

The type operators are used in address expressions. They determine the number of bytes in a single variable, the number of elements in a data array, or the size of a whole data array 1. They are very useful because they can automatically recalculate, for example, the number of iterations necessary to review a string of characters or a data array when its length changes.

Table 1: MASM type operators

operator	value
TYPE	type (number of bytes in one variable)
LENGTH	number of elements in one-dimensional array
SIZE	number of bytes in one-dimensional array
LENGTHOF	number of elements in a multi-dimensional array
SIZEOF	number of bytes in a multi-dimensional array
PTR	type cast operator

The following dependencies occur:
SIZE = TYPE * LENGTH
SIZEOF = TYPE * LENGTHOF

The PTR operator is similar to type casting in other programming languages. In some cases, it is required to specify the size of an operand. For example, if we have the indirect increment instruction. The assembler can't determine the size of the operand in memory pointed with the RBX register, which is why we have to specify the operand size with the PTR operator.

   inc [RBX]          ; Error! - argument size is not specified
   inc BYTE PTR [RBX] ; Increment one byte addressed with RBX
   inc WORD PTR [RBX] ; Increment word addressed with RBX
 
   mov [RSI], AX      ; Store word from AX to memory - AX use determines the size
   mov [RSI], 5       ; Error! - constant operand does not determine the size
 
   mov BYTE PTR [RSI], 5  	; Store 8-bit value
   mov WORD PTR [ESI], 5  	; Store 16-bit value
   mov DWORD PTR [ESI], 5 	; Store 32-bit value

An important operator used in data definitions is DUP. It specifies the number of repetitions of the initial value. We'll present details of it later in this chapter.

Code and data sections

Programs in modern 64-bit operating systems are divided into code and data sections. The operating system maintains the stack, and currently, no stack section is defined in user programs. To start the code section, the .CODE directive is used. The code section contains all instructions in a program. To identify the beginning of the data section, the .DATA directive is used. The data section contains all the variables used in a program.

Up to 32-bit processors, the functional fragments of programs were referred to as segments. It was because they were assigned to segment registers in the processor. Currently, the segmentation mechanism is no longer operational, so the code and data fragments of programs are named sections. However, in many literature sources and internet websites, the name segment can still be frequently found.

Location counter

The location counter is an internal variable, maintained by the assembler, to assign addresses to program items. During assembly, it performs a similar role as the instruction pointer during program execution. The location counter contains the address of the currently processed variable in a data section and the instruction in a code section. Any directive which starts a section defines a new location counter and sets it to 0. If the same section is continued in another place in a program, the location counter increments continuously throughout the whole section. Assembling subsequent bytes increases the content of the location counter by 1. While the SEGMENT and ENDS directives are used, the SEGMENT directive, used for the specific section (segment) for the first time, creates the location counter for this section. The ENDS directive suspends byte counting in a given location counter until the next fragment of the section with the same name starts with another SEGMENT directive. The current value of the location counter can be retrieved with the $ sign.

Selected directives

The ORG directive sets the location counter to the specified value x. It is used to align the parts of the program to a specific address.

The EVEN directive aligns the next variable or instruction on an even byte. As the data elements in modern processors require alignment to addresses divisible by 16, the ALIGN directive is often used instead of EVEN.

The ALIGN directive aligns the next variable or instruction on an address of a byte that is a multiple of the argument. The argument of ALIGN must be a power of two. Empty spaces are filled with zeros for the data section or appropriately-sized NOP instructions for the code section. Note that ALIGN 2 is equal to EVEN.

The LABEL directive creates a new label by assigning the current location-counter value and the given type to the defined name. Usually in a program, the : (colon) sign is used for label definition, but the LABEL directive enables specifying the type of element which the label points to.

Data definition directives

There is a set of directives for defining variables. They enable the assignment of a name to a variable and the specification of its type. They are summarised in a table 2.

Table 2: MASM variable definition directives

Name	data type	data size	comment
DB	byte	1 byte
BYTE	byte	1 byte
SBYTE	signed byte	1 byte
DW	word	2 bytes
WORD	word	2 bytes
SWORD	signed word	2 bytes
DD	doubleword	4 bytes
DWORD	doubleword	4 bytes
SDWORD	signed doubleword	4 bytes
DF	farword	6 bytes	used as a pointer in 32-bit mode
FWORD	farword	6 bytes	used as a pointer in 32-bit mode
DQ	quadword	8 bytes
QWORD	quadword	8 bytes
SQWORD	signed quadword	8 bytes
DT	10 bytes	10 bytes	used as 80-bit BCD integer for FPU
TBYTE	10 bytes	10 bytes	used as 80-bit BCD integer for FPU
OWORD	octalword	16 bytes
REAL4	single precision	4 bytes	floatng point for FPU
REAL8	double precision	8 bytes	floatng point for FPU
REAL10	extended double precision	10 bytes	floatng point for FPU

Variable definition directives can be used to define single variables, data tables or strings. The list of operands determines it. It is allowed to use ? as an operand signalling that the initialisation value remains undefined.

var_x    DB 10            ; single byte variable with initial value 10
var_y    DW 20            ; single word variable with initial value 20
var_z    DD ?             ; single uninitialised doubleword 
table_a  DQ 1, 2, 3, 4, 5 ; table of five quadwords
string_b BYTE "I like assembler" ; string with ASCII codes of all characters

Previously mentioned DUP operator and type operators can be explained with some exemplary data definitions.

			          ; TYPE    LENGHT   SIZE
A     DB 10 DUP (?) 	          ; 1	    10	     10
AB    DW 10 DUP (?) 	          ; 2	    10	     20
ABC   DD 10 DUP (?) 	          ; 4	    10	     40
AD    DB 5 DUP (5 DUP (5 DUP(?))) ; 1       125      125

An example which shows the DUP and SIZEOF operators together with data definitions is in the following code. This code defines the uninitialised 256-byte data buffer and fills it with zeros. Please note that in the mov [RBX], 0 instruction, BYTE PTR must be used, because neither [RBX] nor 0 determines the operand size.

.DATA
buffer DB 256 DUP (?)
.CODE
  lea RBX, buffer
  mov RCX, SIZEOF buffer
clear:
  mov BYTE PTR [RBX], 0
  inc RBX
  loop clear

Constants

Constants in an assembler program define the name for the value that can't be changed during normal program execution. It is the assembly-time assignment of the value and its name. Although their name suggests that their value can't be altered, it is true at the program run-time. Some forms of constants can be modified during assembly time. Usually, constants are used to self-document the code, parameterise the assembly process, and perform assembly-time calculations. The constants can be integer, floating-point numeric, or text strings.
Integer numeric constants can be defined with the data assignment directives, EQU or the equal sign =. The difference is that a numeric constant defined with the EQU directive can’t be modified later in the program, while a constant created with the equal sign can be redefined many times in the program. Numeric constants can be expressed as binary, octal, decimal or hexadecimal values. They can also be a result of an expression calculated during assembly time. It is possible to use a previously defined constant in such an expression.

int_const1 EQU 5                ; no suffix by default decimal value
int_const_dec = 7               ; finished with "d", "D", "t", "T", or by default without suffix
int_const_binary = 100100101b   ; finished with "b", "B", "y", or "Y"
int_const_octal = 372o          ; finished with "o", "O", "q", or "Q"
int_const_hex = 0FFA4h          ; finished with "h", or "H"
int_const_expr = int_const_dec * 5

Floating-point numeric constants can be defined with the EQU directive only. The number can be expressed in decimal or scientific notation.

real_const1 EQU 3.1415          ; decimal
real_const2 EQU 6.28e2          ; scientific

Text string constants can be defined with EQU or TEXTEQU directives. Text constants assigned with the EQU or TEXTEQU directive can be redefined later in the program. The TEXEQU is considered a text macro and is described in the section about macros.

text_const1 EQU 'Hello World!'
text_const2 EQU "Hello World!"

Conditional assembly directives

The condition assembly directives have the same functionality as in high-level language compilers. They control the assembly process by checking the defined conditions and enabling or disabling the process for fragments of the source code.

IF expression, IFE expression - tests the value of the expression and performs (or do not) assemble according to the result (0-false),
IFDEF symbol - tests whether a symbol is defined,
IFNDEF symbol - tests whether a symbol is undefined,
IFB <argument> - tests whether the string argument is empty,
IFNB <argument> - tests whether the string argument is empty.

The conditional assembly statement, together with optional ELSEIF and ELSE statements, is shown in the following code.

IF expression1
statements
[[ELSEIF expression2
statements ]]
[[ELSE
statements ]]
ENDIF

Statements

Statements in assembler programs written in MASM are the lines of code composing the source files. Each MASM statement specifies an instruction for the processor or a directive for the assembler. Statements have up to four fields.

name - Specifies the name of the program line. This can serve as a label for the instruction, allowing other instructions to refer to it by name. Some directives also require naming, specifying a variable, type, constant, segment, macro, procedure and other elements of the source file.
operation - This is the main element which defines the action of the statement. This can be an instruction for the processor or an assembler directive.
operands - This field depends on the operation. Some operations do not accept operands, some require a list of one or more operands. Operands are also referred to as arguments.
comment - This field is for documentation purposes and is ignored by the assembler. Good comments make it easier to understand and maintain the program.

All fields in a statement are optional. A statement can be composed of a label only (ended with a colon), an operation only (if it doesn't require operands), or a comment only. A few examples of proper statements are presented in the following code.

; name    ; operation ; operands ; comment
 
cns_y      EQU         134       ; definition of a constant named cns_y with the value 134
 
           .DATA                 ; operation only - directive to start data section
var_x      DB          123       ; definition of a variable named var_x with init value 123
 
           .CODE                 ; operation only - directive to start code section
begin:                           ; name only - label that represents an address
           mov         rax, rbx  ; operation and corresponding operands
                                 ; comment only statement
           END                   ; operation only - end of the source file

^[1] https://learn.microsoft.com/en-us/cpp/assembler/masm/microsoft-macro-assembler-reference?view=msvc-170

^[2] https://learn.microsoft.com/en-us/cpp/assembler/masm/operators-reference?view=msvc-170

Table of Contents