CS 208 s20 — Data Structure Representation: Arrays

1. Video
2. Warmup
3. Compilation
- 3.1. Producing Machine Language
4. Array Basics
5. Arrays and Functions
6. Nested Arrays
7. Homework

1 Video

Here is a video lecture for the material outlined below. It covers CSPP section 3.8 (p. 255–265). It contains sections on

the Sora (0:00)
compilation overview (0:38)
linking (4:52
warmup exercise (11:15)
array basics (13:03)
example array expressions (17:11)
array access in assembly (28:39)
for loop/array access exercise (31:15)
arrays and functions (33:26)
2D arrays (42:00)
end notes (52:17)

The Panopto viewer has table of contents entries for these sections. Link to the Panopto viewer: https://carleton.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=ca2b2f65-5029-4466-810c-abb80107011c

2 Warmup

If %rsp is 0x100 and the current stack frame consists of a local array of 4 ints, where is the return address for the current function call stored?¹

3 Compilation

We will start by zooming out a little from assembly to look briefly at the whole compilation process
Let's say out code is in two files: p1.c and p2.c
We compile our program with the command: gcc -Og p1.c p2.c -o p
- this produces a binary file p containing the resulting machine code
- we can run the program with the command: ./p

compilation consists of 3 primary phases:
1. translating source code (e.g., C code) to assembly (still in human-readable text form at this point)
2. translating assembly to binary object files (each source file is assembled separately)
3. linking object files (and any standard library files) together into the final executable

3.1 Producing Machine Language

how do we go from C code to machine code? How do we get the information we need?
simple cases: arithmetic and logical operations, shifts, etc.
- all necessary information is contained in the instruction itself (source, destination, size)
but consider
- conditional jump
- accessing static data
- call
all of these depend on addresses and/or labels
- these are a problem because we don't have the final executable yet (i.e., we don't know where these will be located in the final executable)
this is one purpose of the linking step: connect everything together and fill in these "holes" with the appropriate values

4 Array Basics

T A[N] means an array of N elements of type T
contiguously allocated region—how big in terms of N?²
A is a pointer (T*) to the start of the array

4.1 Array Access

int x[5] = {3, 7, 1, 9, 5};
indexes 0, 1, 2, 3, 4
addresses a, a+4, a+8, a+12, a+16, a+20
- where a is the address of the start and a+20 is the address of the end of the array

Expression	Type	Value
`x[4]`	`int`	`5`
`x`	`int*`	`a`
`x + 1`	`int*`	`a+4`
`&x[2]`	`int*`	`a+8`
`x[5]`	`int`	?? (whatever's there in memory)
`*(x + 1)`	`int`	`7`
`x + i`	`int*`	`a + 4*i`

4.2 Pointer Arithmetic

C allows pointer arithmetic where the result is scaled according to the size of the data type referenced by the pointer
array subscripting is the combination of pointer arithmetic and dereference (e.g., A[i] is equivalent to *(A+i))
int *nums and int nums[] are nearly identical declarations
- subtle differences include initialization, sizeof
an array name is an expression (not a variable) that returns the address of the array
- it looks like a pointer to the first (0th) element
  - *ar same as ar[0], *(ar+2) same as ar[2]
an array name is read-only (no assignment) because it is a label
- cannot use ar = <anything>

int get_digit(int z[5], int digit) {
    return z[digit];
}

get_digit:
        movq    (%rdi,%rsi,4), %rax
        ret

4.3 Exercise

Given the C code and the register to variable mapping below, see how far you can get filling in the corresponding assembly:³

for (long i = 0; i < size; i++) {
    total += arr[i];
}

Register	Use
`%rdi`	`arr`
`%rsi`	`size`
`%rdx`	`i`
`%rax`	`total`

init:
        ________________
        ________________
body:
        ________________
        ________________
test:
        ________________
        ________________

5 Arrays and Functions

arrays declared as local variables are allocated in the current stack frame

char* foo() {
    char string[32]; ...;
    return string;
}

void bar() {
    int nums[6];
    ...
}

int main() {
    char *str = foo(); // after foo returns, its stack frame (containing str) is popped off
                       // the string is still there for now...
    bar(); // stack-allocated nums will use the same memory as string from foo
    ... using str ... // program can still use str (refers to valid memory address), but its
                      // contents have been reused in the meantime
}

the above code is broken, future function calls will overwrite string (draw stack frame)
an array is passed to a function as a pointer—this means the size gets lost!

int foo(int arr[], unsigned int size) {
    ... arr[size - 1] ...
}

arr is really an int* (%rdi can only fit 8 bytes)
without an explicite size parameter, no way to determine the length of the array

6 Nested Arrays

T A[R][C]
- 2D array to data type T
- R rows, C columns
- What's the array's total size? R*C*sizeof(T)
single contigious block of memory
stored in row-major order
- all elements in row 0, followed by all elements in row 1, etc.
- address of row i is A + i*(C * sizeof(T))

int sea[4][5] = 
  {{ 9, 8, 1, 9, 5},
   { 9, 8, 1, 0, 5},
   { 9, 8, 1, 0, 3},
   { 9, 8, 1, 1, 5}}

https://docs.google.com/spreadsheets/d/17HGr47X1Q8EqkFmZ4Fv8Y54mO8o-wIPaabQBf_bORZ8/edit?usp=sharing

int* get_sea_zip(int index) {
    return sea[index];
}

get_sea_zip:
        leaq    (%rdi,%rdi,4), %rax  # 5 * index
        leaq    sea(,%rax,4), %rax   # sea + 20 * index
        ret
sea:
        .long   9
        .long   8
        .long   1
        .long   9
        .long   5
        ...

A[i][j] to access an individual element of a nested array
- the address works out to A + i*(C*sizeof(T)) + j*sizeof(T) == A + (i*C + j)*sizeof(T)

int get_sea_digit (int index, int digit) {
    return sea[index][digit];
}

get_sea_digit:
        leaq (%rdi,%rdi,4), %rax  # 5 * index
        addl %rax, %rsi           # 5 * index + digit
        movl sea(,%rsi,4), %eax   # *(sea + 4 * (5 * index + digit))

7 Homework

CSPP practice problems 3.36 (p. 256) and 3.37 (p. 258)
Keep working on Lab 2! The handin script has proved too unstable, so we'll do submissions via Moodle from here on out. For lab 2, upload your solutions.txt to the lab 2 assignment on Moodle.

Footnotes:

The return address is located at 0x110 (16 bytes above the current %rsp, as an array of 4 ints takes up 16 bytes and the return address is stored immediately above the stack frame).

N * sizeof(T)

init:
        movl $0, %edx
        jmp  test
body:
        addl (%rdi, %rdx, 4), %eax
        addq $1, %rdx
test:
        cmpq %rsi, %rdx
        jl   body