Understanding x86 Instructions

If you are going to be reverse engineering binaries it is vital that you understand what information the disassembler is giving you. Unfortunately for us, we can’t regenerate the source code for any given binary (though Ghidra and IDA get close!). We are instead given the assembly code for a binary and it is up to us to piece together what the binary is doing by reading what the disassembler shows us.

My favorite reference for x86 instructions that I’m not familiar with or need to brush up on the allowable operands of is Felix Cloutier’s x86 and amd64 Reference Manual. Much of the information below can be found in that reference manual.

An excellent tool to view assembly online is Matt Godbolt’s Compiler Explorer. Compiler Explorer is an interactive compiler. The left-hand pane shows editable C, C++, Rust, Go, D, Haskell, Swift and Pascal code. The right, the assembly output of having compiled the code with a given compiler and settings. Compiler Explorer supports MANY different compilers as well.

Terminology

With any new topic, the terminology can be quite confusing the first few times you come across it. To understand the basics here, I will cover the following terms:

Instruction
Destination and Source Operands
Immediate
Register
Memory Location
Intel/AT&T Syntax

Instruction

An x86 instruction is a statement that is executed at runtime. For example, mov, pop, push, or even cmpxchg8b are all instructions. You may here these referred to as opcodes, but that’s typically used when referring to the hex value of the instruction.

Destination Operand and Source Operands

An x86 instruction can have zero to three operands (or “arguments”). Operands are separated by commas. In AT&T syntax, for operations with two operands, the first (lefthand) operand is the source operand, and the second (righthand) operand is the destination operand. For Intel, the reverse is true.

Intel: mov dest, src -> mov eax, 1
AT&T: mov src, dest -> mov $1, %eax

Immediate

An immediate value (or simply an immediate or imm) is a piece of data that is stored as part of the instruction itself instead of being in a memory location or a register.

Immediate values are typically used in instructions that load a value or perform an arithmetic or a logical operation on a constant.

       ; push opcode    3
push 3 ;     6A        03

In the above example, the instruction push 3 gets written as 6A03. 3 is an immediate value as it is a part of the instruction itself.

If a value refers to memory or registers, it is not immediate.

Register

A register is a storage area inside the CPU. There are general purpose registers (rax, rbx, rcx, rdx), registers which have special usage (for example, the program counter registers), and various others (memory/segment registers, SSE).

The 64-bit versions of the x86 registers are named:

rax - register a extended
rbx - register b extended
rcx - register c extended
rdx - register d extended
rbp - register base pointer (start of stack)
rsp - register stack pointer (current location in stack, growing downwards)
rsi - register source index (source for data copies)
rdi - register destination index (destination for data copies)
r8-r15 - register 8-15

You can access these registers using the following conventions:

64-bit registers using the r prefix: rax, r15
32-bit registers using the e prefix (e_x) or d suffix (added registers: r__d): eax, r15d
16-bit registers using no prefix (_x) or a w suffix (added registers: r__w): ax, r15w
8-bit registers using h (“high byte” of 16 bits) suffix (8-15: _h): ah, bh
8-bit registers using l (“low byte” of 16 bits) suffix (bits 0-7: _l) or b suffix (added registers: r__b): al, bl, r15b

In summary:

Bit-length

rax

eax

8-high

8-low

Memory Location

A memory location is as the name suggests the location of some area of memory.

For example, below we grow the stack and access the memory location when performing the fill array instruction and when moving the array’s value into the rdi register.

array_func:
	; Prologue
	push rbp
	mov rbp, rsp

	; Grow the stack
	sub rsp, 10

	; Fill array
	mov BYTE [rbp-0xa], 0

	; Our exit value
	mov rdi, [rbp-0xa]

	; Epilogue
	mov rsp, rbp
	pop rbp
	ret

In Intel syntax, you can access memory using brackets whereas in AT&T syntax it is enclosed in parentheses.

Intel/AT&T Syntax

Intel and AT&T syntax Assembly language are very different from each other in appearance, and this will lead to confusion when one first comes across AT&T syntax after having learnt Intel syntax first, or vice versa.

In AT&T syntax, registers are prefixed with a percent (%) sign and immediate values with a dollar ($) sign. In Intel syntax, numbers are suffixed with either an h for hex or d for decimal, while in AT&T they are prefixed with 0x for hex and nothing for decimal.

Intex Syntax           AT&T Syntax
mov     eax,1          movl    $1,%eax
mov     ebx,0ffh        movl    $0xff,%ebx
int     80h            int     $0x80

In AT&T syntax, the first operand is the source while the second operand is the destination. In Intel syntax, the first operand is the destination while the second operand is the source.

Intel Syntax           AT&T Syntax
instr  dest, src       instr  src, dest
mov    eax, 1          movl   %eax, $1

In Intel syntax, you can access memory using brackets whereas in AT&T syntax it is enclosed in parentheses. The structure of the instruction is also very different.

Intel Syntax           AT&T Syntax
mov BYTE [ebp-5], 1    movl $1, -5(%ebp)

To switch to Intel syntax in the following tools, use these options:

GDB: set disassembly-flavor intel
GCC: -masm=intel
Objdump: --disassembler-options=intel

Whether you prefer Intel or AT&T is up to you, but just be aware that Intel vs AT&T is a fairly niche Internet “holy war”. Those on the opposing side will vehemently disagree with you.

x86 Instructions

After a lot of boring (but necessary!) terminology, let’s jump into x86 instructions. Below you will find a few very common x86 instructions. Keep in mind that this is a very short list to introduce you into what kind of operations you will likely be seeing when disassembling a binary and it in no way is an exhaustive list. An exhaustive introduction to anything never does anyone any good.

add, sub, mul, and div

All of these basic math operations are fairly simple and you should already understand how they work, but for completeness and for reference I will leave them here.

add

add takes the form add dest, src. This equates to dest = dest + src.

The destination operand can be a register or a memory location; the source operand can be an immediate, a register, or a memory location. However, two memory operands cannot be used in one instruction.

When an immediate value is used as an operand, it is sign-extended to the length of the destination operand format.

sub

sub takes the form sub dest, src. This equates to dest = dest - src.

When an immediate value is used as an operand, it is sign-extended to the length of the destination operand format.

mul

mul takes the form mul dest, src. This equates to dest = dest * src.

The destination operand is an implied operand located in register AL, AX or EAX (depending on the size of the operand); the source operand is located in a general-purpose register (E*X) or a memory location.

The result is stored in register AX, register pair DX:AX, or register pair EDX:EAX (depending on the operand size), with the high-order bits of the product contained in register AH, DX, or EDX, respectively.

The action of this instruction and the location of the result depends on the opcode and the operand size.

Operand Size

Source 1

Source 2

Destination

Byte

r/m8

Word

r/m16

DX:AX

Doubleword

EAX

r/m32

EDX:EAX

Quadword

RAX

r/m64

RDX:RAX

div

div takes the form div dest, src. This equates to dest = dest / src.

Divides unsigned the value in the AX, DX:AX, EDX:EAX, or RDX:RAX registers (dividend) by the source operand (divisor) and stores the result in the AX (AH:AL), DX:AX, EDX:EAX, or RDX:RAX registers. The source operand can be a general-purpose register or a memory location.

Operand Size

Dividend

Divisor

Quotient

Remainder

Max Quotient

Word/Byte

r/m8

255

Doubleword/word

DX:AX

r/m16

65,535

Quadword/Doubleword

EDX:EAX

r/m32

EAX

EDX

2^32 - 1

Doublequadword/Quadword

RDX:RAX

r/m64

RAX

RDX

2^64 - 1

mov

The mov instruction is one of the most basic. It copies a value from the source operand to the destination operand. The source operand can be an immediate value, general-purpose register, segment register, or memory location; the destination register can be a general-purpose register, segment register, or memory location. Both operands must be the same size, which can be a byte, a word, a doubleword, or a quadword.

For example, the following line moves the value 1 into the eax register: mov eax, 1.

This can typically be done in fewer bytes using push and pop. The code above in push/pop form would be as follows:

push 1
pop eax

push and pop

Push and Pop are stack-related instructions. We will discuss the concept of push and pop operations on a stack, but we won’t discuss the call stack or stack pointer until stack-based buffer overflows. If you’re looking for a more in-depth analysis on a program’s stack I would recommend visiting this section.

push

+===+     +===+  +===+
| 2 |\    |   |  |   |
+===+ \   +===+  +===+
       -->|   |  | 2 |
          +---+  +---+
          | 1 |  | 1 |
          +===+  +===+

Push decrements the stack pointer and then stores the source operand on the stack. A push instruction looks like push eax or push 0x1. It only takes a single operand.

The first rectangle illustrates a stack containing {1} with a second value (2) in the process of being pushed on the stack. The second rectangle shows the resulting stack, after the push operation, containing {1, 2}.

pop

+===+  +===+      +===+
|   |  |   | /--->| 2 |
+===+  +===+/     +===+
| 2 |  |   |
+---+  +---+
| 1 |  | 1 |
+===+  +===+

Conversely, pop loads the value from the top of the stack to the location specified with the destination operand (or explicit opcode) and then increments the stack pointer.

The first rectangle illustrates a stack containing {1, 2}. The second rectangle shows the result of a stack containing {1, 2} after a pop operation was performed on it. The 2 was taken off of the stack and only {1} remains.

lea

The lea instruction often eludes a lot of beginner reverse engineers. It stands for Load Effective Address. It’s typically used, as the name suggests, to move an address into the destination operand. The source operand is a memory address (offset part) specified with one of the processors addressing modes; the destination operand is a general-purpose register.

+---------------+               +------------+
| Registers     |               | Memory     |
+---------------+               +------------+
| EAX = 0x000000|     0x4008e0 -> 0x7ffff7b4 |
| EBX = 0x4008e0|     0x4008e4 -> 0x00211000 |
+---------------+               +------------+

For example, the instruction lea eax, [ebx + 4] would result in EAX containing 0x4008e4. By contrast, a similar mov instruction mov eax, [ebx + 4] would result in EAX containing 0x00211000.

LEA may also be used for generic calculations: lea eax, [ eax + ebx + 1234567 ] calculates EAX + EBX + 1234567.

For more information on LEA, go to this very detailed, well-written StackOverflow post.

jmp

jmp is very easy to understand if you’ve ever written a basic program before. jmp instructions are created whenever your program creates a branch. These can be optimized out by the compiler occasionally. The jmp instruction changes what is called the control flow of a program. If you read through the Exploit Development Toolchain section, you might recognize that term. The control flow of a program describes what instructions get executed in any particular order if at all.

There are multiple “forms” of jump instructions depending on the condition you want to jump for. There are quite a few (jmp, je, jne, jg, etc.) and they are listed here. jmp and friends take a single operand - the address to jump to.

Let’s take a look at the following C code:

int main(void){
	int x = 1;

	if (x)
		x = 0;
	else
		x = 1;

	return 0;
}

When translated into pseudo-assembly we should be expecting something similar to:

1) mov x, 1 ; int x = 1;
2) cmp x, 0 ; compare to see if x equals 0
3) je  6    ; if (x == 0), goto 6
4) mov x, 0 ; x = 0;
5) jmp 7    ; goto 7;
6) mov x, 1 ; x = 1;
7) return   ; return 0;

call

The call instruction saves procedure linking information on the stack and branches to the called procedure specified using the target operand. In other words, call is used to branch into a function. The target operand specifies the address of the first instruction in the called procedure. The operand can be an immediate value, a general-purpose register, or a memory location.

Take the following very simple C code for example. The main function calls the function func.

void func(void){
}

int main(void){
	func();
	return 0;
}

The objdump below shows that a call instruction was emitted.

00000000000005b5 <func>:
 5b5:	90                   	nop
 5b6:	c3                   	ret    

00000000000005b7 <main>:
 5b7:	55                   	push   rbp
 5b8:	48 89 e5             	mov    rbp,rsp
 5bb:	e8 f5 ff ff ff       	call   5b5 <func>
 5c0:	b8 00 00 00 00       	mov    eax,0x0
 5c5:	5d                   	pop    rbp
 5c6:	c3                   	ret

PreviousExploit Development Toolchain NextCalling Conventions

Last updated 6 years ago

hashtagTerminology

hashtagInstruction

hashtagDestination Operand and Source Operands

hashtagImmediate

hashtagRegister

hashtagMemory Location

hashtagIntel/AT&T Syntax

hashtagx86 Instructions

hashtagadd, sub, mul, and div

hashtagadd

hashtagsub

hashtagmul

hashtagdiv

hashtagmovarrow-up-right

hashtagpush and pop

hashtagpush

hashtagpop

hashtagleaarrow-up-right

hashtagjmparrow-up-right

hashtagcallarrow-up-right