CS 208 s22 — Data Structure Representation: Arrays

1. Array Basics
- 1.1. Array Access
- 1.2. Pointer Arithmetic
2. Arrays and Functions
3. Nested Arrays
4. Multilevel Arrays
5. Warmup
6. Multi-level array exercise
7. Background for Lab 2
8. gdb activity
9. Activity
10. Exercise
11. Practice

1 Array Basics

T A[N] means an array of N elements of type T
contiguously allocated region—how big in terms of N?¹
A is a pointer (T*) to the start of the array

1.1 Array Access

int x[5] = {3, 7, 1, 9, 5};
indexes 0, 1, 2, 3, 4
addresses a, a+4, a+8, a+12, a+16, a+20
- where a is the address of the start and a+20 is the address of the end of the array

Expression	Type	Value
`x[4]`	`int`	`5`
`x`	`int*`	`a`
`x + 1`	`int*`	`a+4`
`&x[2]`	`int*`	`a+8`
`x[5]`	`int`	?? (whatever's there in memory)
`*(x + 1)`	`int`	`7`
`x + i`	`int*`	`a + 4*i`

1.2 Pointer Arithmetic

C allows pointer arithmetic where the result is scaled according to the size of the data type referenced by the pointer
array subscripting is the combination of pointer arithmetic and dereference (e.g., A[i] is equivalent to *(A+i))
int *nums and int nums[] are nearly identical declarations
- subtle differences include initialization, sizeof
an array name is an expression (not a variable) that returns the address of the array
- it looks like a pointer to the first (0th) element
  - *ar same as ar[0], *(ar+2) same as ar[2]
an array name is read-only (no assignment) because it is a label
- cannot use ar = <anything>

int get_digit(int z[5], int digit) {
    return z[digit];
}

get_digit:
        movq    (%rdi,%rsi,4), %rax
        ret

2 Arrays and Functions

an array is passed to a function as a pointer—this means the size gets lost!

int foo(int arr[], unsigned int size) {
    ... arr[size - 1] ...
}

arr is really an int* (%rdi can only fit 8 bytes)
without an explicite size parameter, no way to determine the length of the array

3 Nested Arrays

T A[R][C]
- 2D array to data type T
- R rows, C columns
- What's the array's total size? R*C*sizeof(T)
single contigious block of memory
stored in row-major order
- all elements in row 0, followed by all elements in row 1, etc.
- address of row i is A + i*(C * sizeof(T))

int sea[4][5] = 
  {{ 9, 8, 1, 9, 5},
   { 9, 8, 1, 0, 5},
   { 9, 8, 1, 0, 3},
   { 9, 8, 1, 1, 5}}

https://docs.google.com/spreadsheets/d/17HGr47X1Q8EqkFmZ4Fv8Y54mO8o-wIPaabQBf_bORZ8/edit?usp=sharing

int* get_sea_zip(int index) {
    return sea[index];
}

get_sea_zip:
        leaq    (%rdi,%rdi,4), %rax  # 5 * index
        leaq    sea(,%rax,4), %rax   # sea + 20 * index
        ret
sea:
        .long   9
        .long   8
        .long   1
        .long   9
        .long   5
        ...

A[i][j] to access an individual element of a nested array
- the address works out to A + i*(C*sizeof(T)) + j*sizeof(T) == A + (i*C + j)*sizeof(T)

int get_sea_digit (int index, int digit) {
    return sea[index][digit];
}

get_sea_digit:
        leaq (%rdi,%rdi,4), %rax  # 5 * index
        addl %rax, %rsi           # 5 * index + digit
        movl sea(,%rsi,4), %eax   # *(sea + 4 * (5 * index + digit))

4 Multilevel Arrays

is this multi-dimensional array equivalent to previous sea?

int sea0[5] = {9, 8, 1, 9, 5};
int sea1[5] = {9, 8, 1, 0, 5};
int sea2[5] = {9, 8, 1, 0, 3};
int sea3[5] = {9, 8, 1, 1, 5};
int *sea_m[4] = {sea0, sea1, sea2, sea3};

it contains the same 20 ints
however, each of the four elements of sea_m is a pointer—none of the elements of sea were pointers
within each of the rows (sea0, sea1, sea2, sea3), the 5 ints are allocated as a contiguous block of memory, but each row could be put anywhere
see the difference visually in this spreadsheet

the C code for get_sea_m_digit is the same as get_sea_digit for 2D arrays

int get_sea_m_digit (int index, int digit) {
    return sea_m[index][digit];
}

but the assembly for accessing an element will be different

get_sea_digit:
        leaq (%rdi,%rdi,4), %rax  # 5 * index
        addl %rax, %rsi           # 5 * index + digit
        movl sea(,%rsi,4), %eax   # *(sea + 4 * (5 * index + digit))
        ret

get_sea_m_digit:
        salq $2, %rsi # rsi = 4*digit
        addq sea_m(,%rdi,8), %rsi # p = sea_m[index] + 4*digit
        movl (%rsi), %eax # return *p
        ret

accessing an element now requies two memory accesses
the benefit of this multilevel structure is that the rows can be different lengths
array access looks the same sea[3][2] and sea_m[3][2], but underneath
- Mem[ sea + 20*index + 4*digit ] vs Mem[ Mem[ sea_m + 8*index ] + 4*digit ]

5 Warmup

For each of the following array accesses to the array pictured below, determine if it is a valid access and, if so, what value it returns.²

sea[2][5]
sea[4][-1]
sea[0][19]

Which of the following statements is FALSE?³

sea[4][-2] is a valid array reference
sea[1][1] makes two memory accesses
sea[2][1] will always be a higher address than sea[1][2]
sea[2] is calculated using only lea

6 Multi-level array exercise

For each of the following array accesses to the array pictured below, determine if it is a valid access and, if so, what value it returns.⁴

sea[2][3]
sea[1][5]
sea[2][-2]

7 Background for Lab 2

register uses: %rax, %rsp, %rdi, %rsi, %rdx, %rcx, %r8, %r9
starting lab 2
- hacking vs analyzing
- either way, practice the skill of finding the relevant detail and ignoring the rest
basic commands: r(un), b(reak), stepi (si), nexti (ni), c(ontinue), layout asm, layout reg, disas, p(rint), x, finish
blank line at the end of defuse.txt
push/pop/moving %rsp at the beginning and end of functions
sscanf

8 `gdb` activity

Get started with these commands:

wget http://cs.carleton.edu/faculty/awb/cs208/s22/notes/gdb-activity.tar
tar xvf gdb-activity.tar
cd gdb-activity
make

Try running the program with ./gdb-activity, what happens?
Open gdb-activity.c

#include <string.h>
#include <stdlib.h>
#include <stdio.h>

int compare(int a, int b);

int main(int argc, char** argv)
{
    int a, b, n;
    char input[100];
    printf("enter good args: ");
    if (fgets(input, 100, stdin) == NULL) {
        printf("I said good args\n");
    }

    n = sscanf(input, "%d %d", &a, &b);
    if (n == 2 && compare(a,b) == 1) {
        printf("good args!\n");
    }
    else {
        printf("bad args, try harder!\n");
    }
    return 0;
}

Observations:

four functions are called: printf, fgets, sscanf, and compare
- the first three are C library functions (since they aren't declared anywhere, they must come from the #include of library headers)
- look each library function up in the terminal with man 3 FUNCTION, or consult cplusplus.com/FUNCTION (the latter is often easier to understand)
to get the program to print "good args!", we need fgets to return something other than NULL, have sscanf return 2, and have compare(a, b) return 1
- fgets will read from the command line (stdin) and store the string in input, up to 100 characters
  - only returns NULL on failure, so we probably don't have to worry about that
- sscanf is a super useful function: it parses a string (the first parameter) according to a format string given by the second parameter
  - %d is the format specifier for an integer, so this sscanf call will parse input as two integers separated by a space
  - the parsed items (i.e., each %d) will be written to the corresponding pointers provided as arguments after the format string
    - so the first integer in input will be stored in a and the second will be stored in b
- compare is actually implemented in raw assembly in gdb-activity.s

Lets use gdb to get a sense for how everything is fitting together. (This tutorial video goes over using gdb if you want to review.)
Running disas compare from within gdb gives

0x00000000004006d7 <+0>:     push   %rbx
0x00000000004006d8 <+1>:     mov    %rdi,%rbx
0x00000000004006db <+4>:     add    $0x5,%rbx
0x00000000004006df <+8>:     add    %rsi,%rbx
0x00000000004006e2 <+11>:    cmp    $0xd0,%rbx
0x00000000004006e9 <+18>:    sete   %al
0x00000000004006ec <+21>:    movzbq %al,%rax
0x00000000004006f0 <+25>:    pop    %rbx
0x00000000004006f1 <+26>:    retq

We can reverse engineer the C code to be something like

int compare(int a, int b) {
    return a + b + 5 == 0xd0; // we add %rdi, %rsi, and 5 together in %rbx and then compare it to $0xd0
                              // sete writes 1 to the given register if the cmp indicates the operands are equal
                              // since this is the return value, and we want compare to return 1, 
                              // we should choose inputs to make these equal
}

9 Activity

What affect does this assembly program have on registers and memory given the initial values below?⁵

f:
        movl    $1, (%rdi)
        movl    $1, 4(%rdi)
        movl    $2, %edx
        jmp     .L2
.L3:
        movslq  %edx, %rax
        salq    $2, %rax
        movl    -8(%rdi,%rax), %ecx
        addl    -4(%rdi,%rax), %ecx
        movl    %ecx, (%rdi,%rax)
        addl    $1, %edx
.L2:
        cmpl    %esi, %edx
        jl      .L3
        rep ret
main:
        subq    $32, %rsp
        movl    $7, %esi
        movq    %rsp, %rdi
        call    f
        movl    $0, %eax
        addq    $32, %rsp
        ret

10 Exercise

Given the C code and the register to variable mapping below, see how far you can get filling in the corresponding assembly:⁶

for (long i = 0; i < size; i++) {
    total += arr[i];
}

Register	Use
`%rdi`	`arr`
`%rsi`	`size`
`%rdx`	`i`
`%rax`	`total`

init:
        ________________
        ________________
body:
        ________________
        ________________
test:
        ________________
        ________________

11 Practice

CSPP practice problems 3.36 (p. 256) and 3.37 (p. 258)

Footnotes:

N * sizeof(T)

valid, 0
invalid, past the end of row 1 (and rows are not adjacent in memory in this multilevel array)
invalid, past the start of row 2 (and rows are not adjacent in memory in this multilevel array)

2. is the false statement. The assembly for sea[1][1] will compute the address of that specific element and then make a single memory access to that address.

⁴

valid, 9 (start of row 2, then 5 ints forward)
valid, 5 (start of the non-existent row 4, then 1 int backwards, thereby reference the valid element at the end of row 3)
valid, 5 (start of row 0, then 19 ints forward, referencing the final int in the array)

⁵

Here's video walkthrough. The panopto video is below (view it in panopto here), along with the initial memory/register diagram, the C code and assembly side-by-side in godbolt, and the final memory/register diagram.

⁶

init:
        movl $0, %edx
        jmp  test
body:
        addl (%rdi, %rdx, 4), %eax
        addq $1, %rdx
test:
        cmpq %rsi, %rdx
        jl   body