Understanding x86 Instructions
If you are going to be reverse engineering binaries it is vital that you understand what information the disassembler is giving you. Unfortunately for us, we can’t regenerate the source code for any given binary (though Ghidra and IDA get close!). We are instead given the assembly code for a binary and it is up to us to piece together what the binary is doing by reading what the disassembler shows us.
My favorite reference for x86 instructions that I’m not familiar with or need to brush up on the allowable operands of is Felix Cloutier’s x86 and amd64 Reference Manual. Much of the information below can be found in that reference manual.
An excellent tool to view assembly online is Matt Godbolt’s Compiler Explorer. Compiler Explorer is an interactive compiler. The left-hand pane shows editable C, C++, Rust, Go, D, Haskell, Swift and Pascal code. The right, the assembly output of having compiled the code with a given compiler and settings. Compiler Explorer supports MANY different compilers as well.
Terminology
With any new topic, the terminology can be quite confusing the first few times you come across it. To understand the basics here, I will cover the following terms:
Instruction
Destination and Source Operands
Immediate
Register
Memory Location
Intel/AT&T Syntax
Instruction
An x86 instruction is a statement that is executed at runtime. For example, mov, pop, push, or even cmpxchg8b are all instructions. You may here these referred to as opcodes, but that’s typically used when referring to the hex value of the instruction.
Destination Operand and Source Operands
An x86 instruction can have zero to three operands (or “arguments”). Operands are separated by commas. In AT&T syntax, for operations with two operands, the first (lefthand) operand is the source operand, and the second (righthand) operand is the destination operand. For Intel, the reverse is true.
Intel:
mov dest, src->mov eax, 1AT&T:
mov src, dest->mov $1, %eax
Immediate
An immediate value (or simply an immediate or imm) is a piece of data that is stored as part of the instruction itself instead of being in a memory location or a register.
Immediate values are typically used in instructions that load a value or perform an arithmetic or a logical operation on a constant.
In the above example, the instruction push 3 gets written as 6A03. 3 is an immediate value as it is a part of the instruction itself.
If a value refers to memory or registers, it is not immediate.
Register
A register is a storage area inside the CPU. There are general purpose registers (rax, rbx, rcx, rdx), registers which have special usage (for example, the program counter registers), and various others (memory/segment registers, SSE).
The 64-bit versions of the x86 registers are named:
rax - register a extended
rbx - register b extended
rcx - register c extended
rdx - register d extended
rbp - register base pointer (start of stack)
rsp - register stack pointer (current location in stack, growing downwards)
rsi - register source index (source for data copies)
rdi - register destination index (destination for data copies)
r8-r15 - register 8-15
You can access these registers using the following conventions:
64-bit registers using the
rprefix: rax, r1532-bit registers using the
eprefix (e_x) ordsuffix (added registers: r__d): eax, r15d16-bit registers using no prefix (_x) or a
wsuffix (added registers: r__w): ax, r15w8-bit registers using
h(“high byte” of 16 bits) suffix (8-15: _h): ah, bh8-bit registers using
l(“low byte” of 16 bits) suffix (bits 0-7: _l) orbsuffix (added registers: r__b): al, bl, r15b
In summary:
Bit-length
Register
64
rax
32
eax
16
ax
8-high
ah
8-low
al
Memory Location
A memory location is as the name suggests the location of some area of memory.
For example, below we grow the stack and access the memory location when performing the fill array instruction and when moving the array’s value into the rdi register.
In Intel syntax, you can access memory using brackets whereas in AT&T syntax it is enclosed in parentheses.
Intel/AT&T Syntax
Intel and AT&T syntax Assembly language are very different from each other in appearance, and this will lead to confusion when one first comes across AT&T syntax after having learnt Intel syntax first, or vice versa.
In AT&T syntax, registers are prefixed with a percent (%) sign and immediate values with a dollar ($) sign. In Intel syntax, numbers are suffixed with either an h for hex or d for decimal, while in AT&T they are prefixed with 0x for hex and nothing for decimal.
In AT&T syntax, the first operand is the source while the second operand is the destination. In Intel syntax, the first operand is the destination while the second operand is the source.
In Intel syntax, you can access memory using brackets whereas in AT&T syntax it is enclosed in parentheses. The structure of the instruction is also very different.
To switch to Intel syntax in the following tools, use these options:
GDB:
set disassembly-flavor intelGCC:
-masm=intelObjdump:
--disassembler-options=intel
Whether you prefer Intel or AT&T is up to you, but just be aware that Intel vs AT&T is a fairly niche Internet “holy war”. Those on the opposing side will vehemently disagree with you.
x86 Instructions
After a lot of boring (but necessary!) terminology, let’s jump into x86 instructions. Below you will find a few very common x86 instructions. Keep in mind that this is a very short list to introduce you into what kind of operations you will likely be seeing when disassembling a binary and it in no way is an exhaustive list. An exhaustive introduction to anything never does anyone any good.
add, sub, mul, and div
All of these basic math operations are fairly simple and you should already understand how they work, but for completeness and for reference I will leave them here.
add
add takes the form add dest, src. This equates to dest = dest + src.
The destination operand can be a register or a memory location; the source operand can be an immediate, a register, or a memory location. However, two memory operands cannot be used in one instruction.
When an immediate value is used as an operand, it is sign-extended to the length of the destination operand format.
sub
sub takes the form sub dest, src. This equates to dest = dest - src.
The destination operand can be a register or a memory location; the source operand can be an immediate, a register, or a memory location. However, two memory operands cannot be used in one instruction.
When an immediate value is used as an operand, it is sign-extended to the length of the destination operand format.
mul
mul takes the form mul dest, src. This equates to dest = dest * src.
The destination operand is an implied operand located in register AL, AX or EAX (depending on the size of the operand); the source operand is located in a general-purpose register (E*X) or a memory location.
The result is stored in register AX, register pair DX:AX, or register pair EDX:EAX (depending on the operand size), with the high-order bits of the product contained in register AH, DX, or EDX, respectively.
The action of this instruction and the location of the result depends on the opcode and the operand size.
Operand Size
Source 1
Source 2
Destination
Byte
AL
r/m8
AX
Word
AX
r/m16
DX:AX
Doubleword
EAX
r/m32
EDX:EAX
Quadword
RAX
r/m64
RDX:RAX
div
div takes the form div dest, src. This equates to dest = dest / src.
Divides unsigned the value in the AX, DX:AX, EDX:EAX, or RDX:RAX registers (dividend) by the source operand (divisor) and stores the result in the AX (AH:AL), DX:AX, EDX:EAX, or RDX:RAX registers. The source operand can be a general-purpose register or a memory location.
Operand Size
Dividend
Divisor
Quotient
Remainder
Max Quotient
Word/Byte
AX
r/m8
AL
AH
255
Doubleword/word
DX:AX
r/m16
AX
DX
65,535
Quadword/Doubleword
EDX:EAX
r/m32
EAX
EDX
2^32 - 1
Doublequadword/Quadword
RDX:RAX
r/m64
RAX
RDX
2^64 - 1
The mov instruction is one of the most basic. It copies a value from the source operand to the destination operand. The source operand can be an immediate value, general-purpose register, segment register, or memory location; the destination register can be a general-purpose register, segment register, or memory location. Both operands must be the same size, which can be a byte, a word, a doubleword, or a quadword.
For example, the following line moves the value 1 into the eax register: mov eax, 1.
This can typically be done in fewer bytes using push and pop. The code above in push/pop form would be as follows:
push and pop
Push and Pop are stack-related instructions. We will discuss the concept of push and pop operations on a stack, but we won’t discuss the call stack or stack pointer until stack-based buffer overflows. If you’re looking for a more in-depth analysis on a program’s stack I would recommend visiting this section.
push
Push decrements the stack pointer and then stores the source operand on the stack. A push instruction looks like push eax or push 0x1. It only takes a single operand.
The first rectangle illustrates a stack containing {1} with a second value (2) in the process of being pushed on the stack. The second rectangle shows the resulting stack, after the push operation, containing {1, 2}.
pop
Conversely, pop loads the value from the top of the stack to the location specified with the destination operand (or explicit opcode) and then increments the stack pointer.
The first rectangle illustrates a stack containing {1, 2}. The second rectangle shows the result of a stack containing {1, 2} after a pop operation was performed on it. The 2 was taken off of the stack and only {1} remains.
The lea instruction often eludes a lot of beginner reverse engineers. It stands for Load Effective Address. It’s typically used, as the name suggests, to move an address into the destination operand. The source operand is a memory address (offset part) specified with one of the processors addressing modes; the destination operand is a general-purpose register.
For example, the instruction lea eax, [ebx + 4] would result in EAX containing 0x4008e4. By contrast, a similar mov instruction mov eax, [ebx + 4] would result in EAX containing 0x00211000.
LEA may also be used for generic calculations: lea eax, [ eax + ebx + 1234567 ] calculates EAX + EBX + 1234567.
For more information on LEA, go to this very detailed, well-written StackOverflow post.
jmp is very easy to understand if you’ve ever written a basic program before. jmp instructions are created whenever your program creates a branch. These can be optimized out by the compiler occasionally. The jmp instruction changes what is called the control flow of a program. If you read through the Exploit Development Toolchain section, you might recognize that term. The control flow of a program describes what instructions get executed in any particular order if at all.
There are multiple “forms” of jump instructions depending on the condition you want to jump for. There are quite a few (jmp, je, jne, jg, etc.) and they are listed here. jmp and friends take a single operand - the address to jump to.
Let’s take a look at the following C code:
When translated into pseudo-assembly we should be expecting something similar to:
The call instruction saves procedure linking information on the stack and branches to the called procedure specified using the target operand. In other words, call is used to branch into a function. The target operand specifies the address of the first instruction in the called procedure. The operand can be an immediate value, a general-purpose register, or a memory location.
Take the following very simple C code for example. The main function calls the function func.
The objdump below shows that a call instruction was emitted.
Last updated