Linux Assembly Part 2: Declaring Data - Cookie Engineer's Web Log

Previous article in this series : Linux Assembly Part 1 about Syscalls

This is the second article in the Linux Assembly series. This time, we will focus on how to represent different data types in nasm so that we can do something with them.

Registers

Remember the registers we used last time?

Assembly syntaxes sometimes feels a little special because of the way some registers are reserved for special purposes, so it's important to understand those registers and how they are used together with operators and functions.

There are various registers available for different purposes. See the table below to find out how they are named, and whether or not they're persistent if you make a call .

Description	64 bit	32 bit	16 bit	8 bit	Persistent?
Accumulator	RAX	EAX	AX	AL	No
Base	RBX	EBX	BX	BL	Yes
Counter / 4th Argument	RCX	ECX	CX	CL	No
Data / 3rd Argument	RDX	EDX	DX	DL	No
Stack Pointer	RSP	ESP	SP	SPL	Yes
Base Pointer / Frame Pointer	RBP	EBP	BP	BPL	Yes
1st Argument	RDI	EDI	DI	DIL	No
2nd Argument	RSI	ESI	SI	SIL	No
5th Argument	R8	R8D	R8W	R8B	No
6th Argument	R9	R9D	R9W	R9B	No
Temporary	R10 - R11	R10D - R11D	R10W - R11W	R10B - R11B	No
Callee-Saved Registers	R12 - R15	R12D - R15D	R12W - R15W	R12B - R15W	Yes

Data Types

Remember the .data section in which we declared our Hello, World! message from the previous article?

section .data
	msg db "Hello, World!"

section .text
	global _start

_start:
	; (...)

In the nasm language you can also declare other data types, which we are going to learn about now. Behind the scenes, everything is a byte or a word due to how x86 as an instruction set was designed at the time; but the NASM language abstracts away somewhat higher-level data types and how you can use them in a more typesafe manner.

Bytes and Words

The basic data types in nasm are :

byte is a byte that is 8 bits long.
word is 2 bytes long.
doubleword is 4 bytes long.
quadword is 8 bytes long.
doublequadword is 16 bytes long.

Unsigned Integers

Unsigned integers or signed int are binary numbers that can be represented as a byte , a word , a doubleword , a quadword , or doublequadword . The byte length influences the range of numbers we can represent.

unsigned integer as byte can represent the numbers from 0 to 255 .
unsigned integer as word can represent the numbers from 0 to 65535 .
unsigned integer as doubleword can represent the numbers from 0 to 4294967295 .
unsigned integer as quadword can represent the numbers from 0 to 18446744073709552000 .
unsigned integer as doublequadword can represent the numbers from 0 to 340282366920938463463374607431768211456

Signed Integers

Signed integers or unsigned int are binary numbers that can be represented in the same way as unsigned int but they represent different number ranges.

The first bit is set to 1 for negative numbers and is set to 0 for positive numbers. This means that the number range doesn't start at 0 , and instead starts at -(bitlength/2) and ends at +(bitlength/2)-1 .

signed integer as byte can represent the numbers from -128 to +127 .
signed integer as word can represent the numbers from -32768 to +32767 .
signed integer as doubleword can represent the numbers from -2^31 to +2^31-1 .
signed integer as quadword can represent the numbers from -2^63 to +2^63-1 .
signed integer as doublequadword can represent the numbers from -2^127 to +2^127-1 .

Strings

Strings are represented as double word chunks behind the scenes, which makes them a little quirky to work with. That means strings that are larger than a double word or 4 bytes need to be concatenated together to be used by instructions like cmp due to the bit size limitations of registers.

In order to prevent doing that most of the time, Kernel developers decided to offer syscalls that use references (or pointers) to addresses that contain the strings for that very reason.

So strings are special case that's important to keep in mind. Usually, userspace libraries try to abstract away dealing with string lengths. A common convention in the C ABI world, for example, is that strings are NULL delimited. This means that they have a trailing 0x00 byte that marks the end of the series of bytes that contain a string value.

NASM Pseudo Instructions

The NASM language specifies so-called pseudo instructions. These instructions are not part of the x86 (or x86_64 ) instruction set, but allow us to declare data in a much easier manner.

Declaring Initialized Data

The current pseudo instructions to declare initialized data are :

DB to declare a byte (8 bit)
DW to declare a word (16 bit)
DD to declare a double word (32 bit)
DDQ to declare a double quad word (64 bit)
DO to declare a generic output file (64 bit)
DY and DZ to declare YMM and ZMM registers (See AVX512 )

The limitations of what kind of data you can declare are as follows :

DD can declare a float
DQ can declare a double-precision float
DT can declare a extended-precision float
DT does not accept numeric constants.
DDQ does not accept float constants as operands.
Any operand size larger than DD (double word) does not accept strings as operands.

However, the pseudo instructions are somewhat data type independent, which means that they can have a different effect depending on what data type you're using to declare the data.

Declaring Bytes and Words

As the x86 (and therefore x86_64 ) instruction set is little-endian , the above pseudo instructions also exist to do the conversion from/to endianness for us.

db 0x12               ; 0x12
db 0x11,0x12,0x13     ; 0x12 0x12 0x13
dd 0x11223344         ; 0x44 0x33 0x22 0x11                     (note the endianness)
dq 0x1122334455667788 ; 0x88 0x77 0x66 0x55 0x44 0x33 0x22 0x11 (note the endianness)

Declaring Floating-Point Numbers

The floating point number precision is a little quirky due to their byte length to represent the precision after the comma.

dd 1.234567e20 ; floating-point constant
dq 1.234567e20 ; double-precision float
dt 1.234567e20 ; extended-precision float

Declaring Strings

Both character constants and strings can be declared using single quotation marks around them. However, behind the scenes, string is almost always declared and processed as a double word in many instructions. If you declare a dw 'string' that doesn't fill out all the reserved bytes, a trailing 0x00 byte is added.

db 'A',0x42  ; 'AB' string in ASCII
dw 'A'       ; 0x41 0x00            (filled with trailing 0x00 byte)
dw 'AB'      ; 0x41 0x42
dw 'ABC'     ; 0x41 0x42 0x43 0x00  (filled with trailing 0x00 byte)

You can read more about pseudo instructions in Chapter 3 of the NASM Documentation .

Declaring Uninitialized Data

The current pseudo instructions to declare uninitialize data are :

RESB to reserve a byte (8 bit)
RESW to reserve a word (16 bit)
RESD to reserve a double word (32 bit)
RESO to reserve a generic output file (64 bit)
RESY and RESZ to declare YMM and ZMM registers (See AVX512 )
RESD to reserve a float
RESQ to reserve a double-precision float
REST to reserve an extended-precision float

resb 32     ; reserve 32 bytes
resw  2     ; reserve 2 words
resq 10     ; reserve 10 double-precision floats
resy  2     ; reserve 2 YMM registers
resz  4     ; reserve 4 ZMM registers