CS 208 s21 — Bits & Byte Ordering
Table of Contents
1 Bits in Memory
One of the key reasons we use the C programming language in this course is because it lets the programmer get close to memory (and we're interested in what's going on in memory). This means that, unlike Java or Python, the operations you do in C correspond closely to the underlying arrangement of data in memory. This has some benefits and some serious drawbacks, which I'll discuss in a bit. Before we get to C programming itself, we'll start by going through what memory is and how it's arranged.
1.1 Memory is a Long Array of Bytes
- we will conceive of memory as one giant array of cells or locations, each of which can store one byte (8 bits) of data
- the index or location of a byte in memory is called its memory address
- a memory address, like an array index, is just a number identifying a position within the array of bytes
- where do these addresses come from? how big can they be (i.e., how long is our array of bytes)?
- the organization of a system's memory is based on the word size
- among other things, the word size defines the number of bytes in a memory address
- all memory addresses will be this same fixed length
- the size of a memory address determines the number of possible addresses
- the address space is the amount of memory that can be given addresses
- EXAMPLE: let's bring this together by considering a very small memory
- in this example, memory addresses (and the word size) are 3 bits
- 3 bits give 8 unique combinations, and thus 8 unique memory addresses (i.e., locations for a byte in memory)
- so we say that the address space for this memory is 8 bytes—we have enough unique addresses to refer to a total of 8 different locations/bytes
- modern systems use 64-bit (8-byte) words, for a potential address space of 264 bytes = 18 billion billion bytes = 18 exabytes
1.2 data sizes
- not all data consists of just a single byte
- for example, a C
int
is 32 bits/4 bytes wide- what address should we use?
- any chunk of memory is addressed using the first byte (i.e., lowest memory address)
- just like we saw that the bit pattern
0x4e6f21
had many different interpretations, the interpretation of data at a memory address depends on both the address and the size
C type | bytes (32-bit) | bytes (64-bit) |
---|---|---|
bool |
1 | 1 |
char |
1 | 1 |
short |
2 | 2 |
int |
4 | 4 |
long |
4 | 8 |
char * |
4 | 8 |
int * |
4 | 8 |
float |
4 | 4 |
double |
8 | 8 |
- so a
char
stored at memory address0x100
would consist of just the single byte at0x100
2 Referring to Bytes
We have some terminology for referring to bytes or bits within a particular piece of data. Specifically, we will refer to the bit in the 1s place as the least significant bit. If we were to write out a binary number, the least significant bit would be the right-most digit. Similarly, we will refer to the bit in the largest place as the most significant bit. This would be the left-most one in a write-out binary number.
We will use this same terminology to refer to bytes within a multi-byte quantity like an int
.
For example, in the int
0x0002459f
, 0x00
is the most significant byte, while 0x9f
is the least significant.
We would say 0x0002
are the two most significant bytes and 0x459f
are the two least significant.
3 Byte Order
- ok, so we know an
int
at0x100
consists of the four bytes at0x100
,0x101
,0x102
, and0x103
- let's say our
int
has the value0x2ab600b
—which byte goes at which address? - this depends on the system we're working on (the system's endianness)
- let's say our
- if the system stores the least significant byte first, we say it is little endian
- used by x86 (i.e., Intel machines), Andriod, iOS
address | byte |
---|---|
0x103 |
2a |
0x102 |
b6 |
0x101 |
60 |
0x100 |
0b |
- if the system stores the most significant byte first, we say it is big endian
- used by networks, Oracle, IBM
address | byte |
---|---|
0x103 |
0b |
0x102 |
60 |
0x101 |
b6 |
0x100 |
2a |
- little endian vs big endian is simply a convention or preference, no empirical reason to prefer one over the other
- why the difference then? In the early days of computing, different system designers made different decisions
- it would be nice if everyone could get together and pick one, but it turns out coordination is hard
- why the difference then? In the early days of computing, different system designers made different decisions
- programmer doesn't need to care in most cases
- it matters when data being sent between machines with different endianness
- will need to keep little-endian ordering in mind once we start working with x86-64 assembly