x86 Exploit Development Pt 1 – Intro to Computer Organization and x86 Instruction Set Architecture Fundamentals

Hey guys! I figured that it would be beneficial to have an entire post dedicated to teaching some fundamentals about Computer Organization and the x86 Instruction Set Architecture, since I will be referencing this particular ISA (instruction set architecture) throughout most of my tutorials on Exploit Development and Reverse Engineering.

This will be updated over time as I find more information that might be useful for someone to know when working on these topics!

Computer Organization Basics – Memory:

It would be valuable to know how memory works and the organization of it, such as bits and bytes, and concepts like little endian and big endian. In our case, x86 uses little endian format. What this means is that our information is stored in virtual memory addresses in reverse order (little end first).

Reference: this website for more information, or Chapter 1 in this book.

Bits and Bytes:

  • A bit has two values (on or off, 1 or 0).
  • A byte is a sequence of 8 bits.
    • Bits are numbered from right-to-left. Bit 0 is the rightmost and the smallest; bit 7 is leftmost and largest.
    • So, the binary sequence 00001001 is the decimal number 9. 00001001 = (23 + 20 = 8 + 1 = 9).
    • A byte can be hold a value of 0-255
      11111111 = 27 + 2+ 25 + 2+ 23 + 2+ 21 + 2= 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = 255

For example,

Byte Name:    A       B       C       D
Location:     0       1       2       3
Value (hex):  0x12    0x34    0x56    0x78

A is an entire byte (8 bits), 0x12 in hex or 00010010 in binary. If A were to be interpreted as a number, it would be “18” in decimal (There is nothing saying we have to interpret it as a number – it could be an ASCII character or something else entirely).

Understanding Pointers:

Pointers are a key part of programming, especially the C and C++ programming languages. A pointer is a programming language object that references (stores) a memory location of another value located in computer memory. It is up to us to interpret the data at that location.

In C, when you cast a pointer to certain type (such as a char * or int *), it tells the computer how to interpret the data at that location. For example, let’s declare

void *p = 0; // p is a pointer to an unknown data type
             // p is a NULL pointer
char *c;     // c is a pointer to a char, usually a single byte

NOTE: We can’t get the data from p because we don’t know its type, it could be pointing at a single number, a letter, the start of a string, an image; we just don’t know how many bytes to read, or how to interpret what’s there.

Now, suppose we write

b = (char *)p;

Now this statement tells the computer to have b point to the same place as p, and interpret the data as a single character (char is typically a single byte).

This example does not depend on the type of computer we have, all computers agree on what a single byte is.

If we have a pointer to a single byte (char *, for example), we can walk through memory, reading off a byte at a time. We can examine any memory location and the endian-ness of a computer won’t matter – every computer will give back the same information.

A pointer stores a memory address of a variable or a memory location. We can dereference a pointer to see what the value is that is stored in the variable whose memory address is stored in the pointer (or value stored at memory location pointed to (stored in) by the pointer). For example:

int a = 15;    // a is an int data type, set to the number 15
int *b;        // b is a pointer that points to an int
b = &a;        // set b to point to &a (memory address of a)
printf(*b);    // if we print *b (dereference b), we get 15
*b = 10;       // we can set the dereferenced b to 10,
printf(a);     // which sets a to 10, since b points to a
printf(b);     // this would print the memory address of variable a
  • The & operator (‘address of’ operator) before a variable name is used to get the address of a variable.
  • The * operator (‘dereference’ operator) when declaring a variable, makes the variable a pointer of the data type provided
    • if the * operator is used on a variable after declaration, it dereferences the address (looks up what is at the address of) that is stored in the variable.
  • For more information and better explanation, see this.

Where Problems Occur:

Problems arise when computers try to read multiple bytes, as some data types contain multiple bytes, rather than a single byte, like long integers or floating-point numbers for example.

Endian-ness:

Multi-Byte data gets stored in one of two ways on a computer, big end first and little end first.

  • Big endian machine: Stores data big-end first. When looking at multiple bytes, the first byte (lowest address) is the biggest.
  • Little endian machine (reverse order): Stores data little-end first. When looking at multiple bytes, the first byte (lowest address) is smallest.

Again, endian-ness does not matter if you have a single byte. If you have one byte, it’s the only data you read so there’s only one way to interpret it.

Now suppose we have our 4 bytes (A B C D) stored the same way on a big-endian and little-endian machine. Memory location 0 is A on both machines, memory location 1 is B, etc. We can theoretically set this up by doing the following (will not actually work):

void *p = (void *) 0x0;    // Point p to location 0
*p = 0x12;                 // Set location 0 to A's value
p = 1;                     // point p to location 1
*p = 0x34;                 // Set location 1 to B's value
...                        // repeat for C and D

This will give us a contiguous set of memory, address locations of 0 to 3, set to the values stored in the bytes A through D respectively.

Interpreting The Data:

Let’s do an example with multi-byte data. For example, a ‘short int‘ is a 2-byte (16 bit) number, which can range from 0-65525 (if unsigned) or -32767 to +32767 (if signed).

short *s;     // pointer to a short int (2 bytes)
s = 0;        // point to location 0; *s is the value

So, s is a pointer to a short (a 2 byte integer), and is now looking at byte location 0 (which has our previous variable A). What happens when we read the value at *s?

  • Big endian machine: A short is 2 bytes, so I’ll read them off: location s is address 0 (A, or 0x12) and location s + 1 is address 1 (B, or 0x34). Since the first byte is biggest (I’m big-endian!), the number must be 0x1234.
  • Little endian machine: I agree, a short is 2 bytes, and I’ll read them off just the same: location s is 0x12, and location s + 1 is 0x34. But in my world, the first byte is the littlest, so the number must be 0x3412.

Keep in mind that both machines start from location 0 (location s) and read memory going towards higher memory addresses (location 1 or location s+1). There is no confusion about what location 0 and location 1 mean and there is no confusion that a short is 2 bytes. But the problem is, one machines interprets the data as 0x1234, and the other machine interprets it as 0x3412.

Endianness determines the direction how we read the bytes, the data type determines how many bytes to interpret. We claimed location 0 to be a short int, which is 2 bytes, so only 2 bytes were interpreted.

If we then had a second short pointer that pointed to location 2, we would have a similar issue with location 2 and 3, giving us 0x5678 on the big endian machine and 0x7856 on the little endian machine.

If we had a long pointer (4 bytes), and set it to location 0, we would have 0x12345678 on the big endian machine, and 0x78563412 on the little endian machine.

x86 Instruction Set Architecture:

Reference: x86 Assembly Tutorial

Some background on computer architecture that would be useful to know when performing Reverse Engineering or doing Exploit Development.

In a processor, things are all dealt with inside of memory and passed around via something called a ‘register‘ and also sometimes pushed onto a ‘stack‘. Each processor has their own set of general purpose registers as well as proprietary registers that are highly specific, but we will only worry about the main set of general purpose registers, specifically those in x86. These registers are as follows:

x86 General Purpose Registers:

Reference: X86_Architecture and this

32 bit will start with E, which stands for Extended (it extended 16 bit registers, which follow the same convention, just drop the E or R at the front, i.e., AX, BX, CX, DX, SP, BP, etc)

64 bit will start with R instead of E, which has no historical significance. (RAX, RBX, RCX, RDX, RSI, RDI, etc)

NOTE: 32 Bit registers start with an E, 64 bit registers start with an R. They both reference the same register, but different sizes. The prefix R only exists on 64 bit architectures. You can reference any size of a register as long as it is of the same size of the architecture or less.

RAX/EAX (64/32 bit Accumulator register): Used in arithmetic operations and also for return values from function calls in certain calling conventions, such as cdecl (this might be useful to help understand a bit more), a common convention in x86.

RBX/EBX (64/32 bit Base register): Sometimes used as a pointer to data. No specific uses, but is often set to a commonly used value (such as 0) throughout a function to speed up calculations

RCX/ECX (64/32 bit Counter register): Used in shift/rotate instructions and loops as a loop counter (like i in for loops).

RDX/EDX (64/32 bit Data register): Used in arithmetic operations and I/O operations, and generally used for storing short-term variables within a function.

RSI/ESI (RSI is 64bit, ESI is 32 bit Source Index register): Used as a pointer to a source in stream operations (like manipulating strings, which are pointers to char arrays).

RDI/EDI (RDI is 64bit, EDI is 32 bit Destination Index register):  Used as a pointer to a destination in stream operations. (also for manipulating strings, see above register).

x86 Special Purpose Registers:

RBP/EBP (RBP is 64bit, EBP is 32 bit Base Pointer): Also known as the frame pointer, stores a pointer that points to the bottom of the stack (higher memory address)

RSP/ESP (RSP is 64bit, ESP is 32 bit Stack Pointer): Stores a pointer to the top of the stack (top of stack is lower memory address than bottom).

RIP/EIP (RIP is 64bit, EIP is 32 bit Instruction Pointer): Stores a pointer to the address of the instruction  is next to be executed.

EFLAGS – a 32-bit register used as a collection of bits representing Boolean values to store the results of operations and the state of the processor

x86 Instruction Layout:

When dealing with x86 instruction with the Intel syntax, instructions are usually written in the form:

instruction destination, source

for example:

mov eax, [ecx]

which means, move what is pointed to by the address in the register ECX and put it into the register EAX. If something is in brackets (like how ECX is), we use the value that is within these brackets as a pointer to a memory address and use the value that is pointed to by this address. This format is almost always the case for Intel syntax, unless stated otherwise.

Will be adding more x86 instructions over time! Just wanted to have this baseline out for now!

Please check back again later for a more up to date version.

Wrapping It Up:

I hope this was a useful tutorial and/or learning opportunity for you guys! If you have any feedback or want to chat, feel free to contact me on one of my many means of communication. Thanks!

@emtuls twitter_thumb

My website
Github
Gmail
LinkedIn

If you’re a veteran interested in Cyber Security, consider joining our Slack channel.

Comments

Leave a Reply to Getting Started Guide for Exploit Development Tutorials – Introduction to Linux 32-bit Buffer Overflows – VeteranSec Cancel reply