Hey guys! I figured that it would be beneficial to have an entire post dedicated to teaching some fundamentals about Computer Organization and the x86 Instruction Set Architecture, since I will be referencing this particular ISA (instruction set architecture) throughout most of my tutorials on Exploit Development and Reverse Engineering.
This will be updated over time as I find more information that might be useful for someone to know when working on these topics!
Computer Organization Basics – Memory:
It would be valuable to know how memory works and the organization of it, such as bits and bytes, and concepts like little endian and big endian. In our case, x86 uses little endian format. What this means is that our information is stored in virtual memory addresses in reverse order (little end first).
Bits and Bytes:
- A bit has two values (on or off, 1 or 0).
- A byte is a sequence of 8 bits.
- Bits are numbered from right-to-left. Bit 0 is the rightmost and the smallest; bit 7 is leftmost and largest.
- So, the binary sequence 00001001 is the decimal number 9. 00001001 = (23 + 20 = 8 + 1 = 9).
- A byte can be hold a value of 0-255
11111111 = 27 + 26 + 25 + 24 + 23 + 22 + 21 + 20 = 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = 255
Byte Name: A B C D Location: 0 1 2 3 Value (hex): 0x12 0x34 0x56 0x78
A is an entire byte (8 bits), 0x12 in hex or 00010010 in binary. If A were to be interpreted as a number, it would be “18” in decimal (There is nothing saying we have to interpret it as a number – it could be an ASCII character or something else entirely).
Pointers are a key part of programming, especially the C and C++ programming languages. A pointer is a programming language object that references (stores) a memory location of another value located in computer memory. It is up to us to interpret the data at that location.
In C, when you cast a pointer to certain type (such as a char * or int *), it tells the computer how to interpret the data at that location. For example, let’s declare
void *p = 0; // p is a pointer to an unknown data type // p is a NULL pointer char *c; // c is a pointer to a char, usually a single byte
NOTE: We can’t get the data from p because we don’t know its type, it could be pointing at a single number, a letter, the start of a string, an image; we just don’t know how many bytes to read, or how to interpret what’s there.
Now, suppose we write
b = (char *)p;
Now this statement tells the computer to have b point to the same place as p, and interpret the data as a single character (
char is typically a single byte).
This example does not depend on the type of computer we have, all computers agree on what a single byte is.
If we have a pointer to a single byte (char *, for example), we can walk through memory, reading off a byte at a time. We can examine any memory location and the endian-ness of a computer won’t matter – every computer will give back the same information.
A pointer stores a memory address of a variable or a memory location. We can dereference a pointer to see what the value is that is stored in the variable whose memory address is stored in the pointer (or value stored at memory location pointed to (stored in) by the pointer). For example:
int a = 15; // a is an int data type, set to the number 15 int *b; // b is a pointer that points to an int b = &a; // set b to point to &a (memory address of a) printf(*b); // if we print *b (dereference b), we get 15 *b = 10; // we can set the dereferenced b to 10, printf(a); // which sets a to 10, since b points to a printf(b); // this would print the memory address of variable a
- The & operator (‘address of’ operator) before a variable name is used to get the address of a variable.
- The * operator (‘dereference’ operator) when declaring a variable, makes the variable a pointer of the data type provided
- if the * operator is used on a variable after declaration, it dereferences the address (looks up what is at the address of) that is stored in the variable.
- For more information and better explanation, see this.
Where Problems Occur:
Problems arise when computers try to read multiple bytes, as some data types contain multiple bytes, rather than a single byte, like long integers or floating-point numbers for example.
Multi-Byte data gets stored in one of two ways on a computer, big end first and little end first.
- Big endian machine: Stores data big-end first. When looking at multiple bytes, the first byte (lowest address) is the biggest.
- Little endian machine (reverse order): Stores data little-end first. When looking at multiple bytes, the first byte (lowest address) is smallest.
Again, endian-ness does not matter if you have a single byte. If you have one byte, it’s the only data you read so there’s only one way to interpret it.
Now suppose we have our 4 bytes (A B C D) stored the same way on a big-endian and little-endian machine. Memory location 0 is A on both machines, memory location 1 is B, etc. We can theoretically set this up by doing the following (will not actually work):
void *p = (void *) 0x0; // Point p to location 0 *p = 0x12; // Set location 0 to A's value p = 1; // point p to location 1 *p = 0x34; // Set location 1 to B's value ... // repeat for C and D
This will give us a contiguous set of memory, address locations of 0 to 3, set to the values stored in the bytes A through D respectively.
Interpreting The Data:
Let’s do an example with multi-byte data. For example, a ‘short int‘ is a 2-byte (16 bit) number, which can range from 0-65525 (if unsigned) or -32767 to +32767 (if signed).
short *s; // pointer to a short int (2 bytes) s = 0; // point to location 0; *s is the value
So, s is a pointer to a short (a 2 byte integer), and is now looking at byte location 0 (which has our previous variable A). What happens when we read the value at
- Big endian machine: A short is 2 bytes, so I’ll read them off: location s is address 0 (A, or 0x12) and location s + 1 is address 1 (B, or 0x34). Since the first byte is biggest (I’m big-endian!), the number must be 0x1234.
- Little endian machine: I agree, a short is 2 bytes, and I’ll read them off just the same: location s is 0x12, and location s + 1 is 0x34. But in my world, the first byte is the littlest, so the number must be 0x3412.
Keep in mind that both machines start from location 0 (location s) and read memory going towards higher memory addresses (location 1 or location s+1). There is no confusion about what location 0 and location 1 mean and there is no confusion that a short is 2 bytes. But the problem is, one machines interprets the data as 0x1234, and the other machine interprets it as 0x3412.
Endianness determines the direction how we read the bytes, the data type determines how many bytes to interpret. We claimed location 0 to be a short int, which is 2 bytes, so only 2 bytes were interpreted.
If we then had a second short pointer that pointed to location 2, we would have a similar issue with location 2 and 3, giving us 0x5678 on the big endian machine and 0x7856 on the little endian machine.
If we had a long pointer (4 bytes), and set it to location 0, we would have 0x12345678 on the big endian machine, and 0x78563412 on the little endian machine.
x86 Instruction Set Architecture:
Reference: x86 Assembly Tutorial
Some background on computer architecture that would be useful to know when performing Reverse Engineering or doing Exploit Development.
In a processor, things are all dealt with inside of memory and passed around via something called a ‘register‘ and also sometimes pushed onto a ‘stack‘. Each processor has their own set of general purpose registers as well as proprietary registers that are highly specific, but we will only worry about the main set of general purpose registers, specifically those in x86. These registers are as follows:
x86 General Purpose Registers:
32 bit will start with E, which stands for Extended (it extended 16 bit registers, which follow the same convention, just drop the E or R at the front, i.e., AX, BX, CX, DX, SP, BP, etc)
64 bit will start with R instead of E, which has no historical significance. (RAX, RBX, RCX, RDX, RSI, RDI, etc)
NOTE: 32 Bit registers start with an E, 64 bit registers start with an R. They both reference the same register, but different sizes. The prefix R only exists on 64 bit architectures. You can reference any size of a register as long as it is of the same size of the architecture or less.
RAX/EAX (64/32 bit Accumulator register): Used in arithmetic operations and also for return values from function calls in certain calling conventions, such as cdecl (this might be useful to help understand a bit more), a common convention in x86.
RBX/EBX (64/32 bit Base register): Sometimes used as a pointer to data. No specific uses, but is often set to a commonly used value (such as 0) throughout a function to speed up calculations
RCX/ECX (64/32 bit Counter register): Used in shift/rotate instructions and loops as a loop counter (like i in for loops).
RDI/EDI (RDI is 64bit, EDI is 32 bit Destination Index register): Used as a pointer to a destination in stream operations. (also for manipulating strings, see above register).
x86 Special Purpose Registers:
RIP/EIP (RIP is 64bit, EIP is 32 bit Instruction Pointer): Stores a pointer to the address of the instruction is next to be executed.
EFLAGS – a 32-bit register used as a collection of bits representing Boolean values to store the results of operations and the state of the processor
x86 Instruction Layout:
When dealing with x86 instruction with the Intel syntax, instructions are usually written in the form:
instruction destination, source
mov eax, [ecx]
which means, move what is pointed to by the address in the register ECX and put it into the register EAX. If something is in brackets (like how ECX is), we use the value that is within these brackets as a pointer to a memory address and use the value that is pointed to by this address. This format is almost always the case for Intel syntax, unless stated otherwise.
Will be adding more x86 instructions over time! Just wanted to have this baseline out for now!
Please check back again later for a more up to date version.
Wrapping It Up:
I hope this was a useful tutorial and/or learning opportunity for you guys! If you have any feedback or want to chat, feel free to contact me on one of my many means of communication. Thanks!
If you’re a veteran interested in Cyber Security, consider joining our Slack channel.