CS 208 s22 — Data Structure Representation: Arrays

Table of Contents

1 Array Basics

  • T A[N] means an array of N elements of type T
  • contiguously allocated region—how big in terms of N?1
  • A is a pointer (T*) to the start of the array

array-types.png

1.1 Array Access

  • int x[5] = {3, 7, 1, 9, 5};
  • indexes 0, 1, 2, 3, 4
  • addresses a, a+4, a+8, a+12, a+16, a+20
    • where a is the address of the start and a+20 is the address of the end of the array
Expression Type Value
x[4] int 5
x int* a
x + 1 int* a+4
&x[2] int* a+8
x[5] int ?? (whatever's there in memory)
*(x + 1) int 7
x + i int* a + 4*i

1.2 Pointer Arithmetic

  • C allows pointer arithmetic where the result is scaled according to the size of the data type referenced by the pointer
  • array subscripting is the combination of pointer arithmetic and dereference (e.g., A[i] is equivalent to *(A+i))
  • int *nums and int nums[] are nearly identical declarations
    • subtle differences include initialization, sizeof
  • an array name is an expression (not a variable) that returns the address of the array
    • it looks like a pointer to the first (0th) element
      • *ar same as ar[0], *(ar+2) same as ar[2]
  • an array name is read-only (no assignment) because it is a label
    • cannot use ar = <anything>
int get_digit(int z[5], int digit) {
    return z[digit];
}
get_digit:
        movq    (%rdi,%rsi,4), %rax
        ret

2 Arrays and Functions

  • an array is passed to a function as a pointer—this means the size gets lost!
int foo(int arr[], unsigned int size) {
    ... arr[size - 1] ...
}
  • arr is really an int* (%rdi can only fit 8 bytes)
  • without an explicite size parameter, no way to determine the length of the array

3 Nested Arrays

  • T A[R][C]
    • 2D array to data type T
    • R rows, C columns
    • What's the array's total size? R*C*sizeof(T)
  • single contigious block of memory
  • stored in row-major order
    • all elements in row 0, followed by all elements in row 1, etc.
    • address of row i is A + i*(C * sizeof(T))
int sea[4][5] = 
  {{ 9, 8, 1, 9, 5},
   { 9, 8, 1, 0, 5},
   { 9, 8, 1, 0, 3},
   { 9, 8, 1, 1, 5}}

https://docs.google.com/spreadsheets/d/17HGr47X1Q8EqkFmZ4Fv8Y54mO8o-wIPaabQBf_bORZ8/edit?usp=sharing

int* get_sea_zip(int index) {
    return sea[index];
}
get_sea_zip:
        leaq    (%rdi,%rdi,4), %rax  # 5 * index
        leaq    sea(,%rax,4), %rax   # sea + 20 * index
        ret
sea:
        .long   9
        .long   8
        .long   1
        .long   9
        .long   5
        ...
  • A[i][j] to access an individual element of a nested array
    • the address works out to A + i*(C*sizeof(T)) + j*sizeof(T) == A + (i*C + j)*sizeof(T)
int get_sea_digit (int index, int digit) {
    return sea[index][digit];
}
get_sea_digit:
        leaq (%rdi,%rdi,4), %rax  # 5 * index
        addl %rax, %rsi           # 5 * index + digit
        movl sea(,%rsi,4), %eax   # *(sea + 4 * (5 * index + digit))

4 Multilevel Arrays

  • is this multi-dimensional array equivalent to previous sea?
int sea0[5] = {9, 8, 1, 9, 5};
int sea1[5] = {9, 8, 1, 0, 5};
int sea2[5] = {9, 8, 1, 0, 3};
int sea3[5] = {9, 8, 1, 1, 5};
int *sea_m[4] = {sea0, sea1, sea2, sea3};
  • it contains the same 20 ints
  • however, each of the four elements of sea_m is a pointer—none of the elements of sea were pointers
  • within each of the rows (sea0, sea1, sea2, sea3), the 5 ints are allocated as a contiguous block of memory, but each row could be put anywhere
  • see the difference visually in this spreadsheet
  • the C code for get_sea_m_digit is the same as get_sea_digit for 2D arrays
int get_sea_m_digit (int index, int digit) {
    return sea_m[index][digit];
}
  • but the assembly for accessing an element will be different
get_sea_digit:
        leaq (%rdi,%rdi,4), %rax  # 5 * index
        addl %rax, %rsi           # 5 * index + digit
        movl sea(,%rsi,4), %eax   # *(sea + 4 * (5 * index + digit))
        ret

get_sea_m_digit:
        salq $2, %rsi # rsi = 4*digit
        addq sea_m(,%rdi,8), %rsi # p = sea_m[index] + 4*digit
        movl (%rsi), %eax # return *p
        ret
  • accessing an element now requies two memory accesses
  • the benefit of this multilevel structure is that the rows can be different lengths
  • array access looks the same sea[3][2] and sea_m[3][2], but underneath
    • Mem[ sea + 20*index + 4*digit ] vs Mem[ Mem[ sea_m + 8*index ] + 4*digit ]

5 Warmup

For each of the following array accesses to the array pictured below, determine if it is a valid access and, if so, what value it returns.2

  1. sea[2][5]
  2. sea[4][-1]
  3. sea[0][19]

array-2d.png

Which of the following statements is FALSE?3

  1. sea[4][-2] is a valid array reference
  2. sea[1][1] makes two memory accesses
  3. sea[2][1] will always be a higher address than sea[1][2]
  4. sea[2] is calculated using only lea

6 Multi-level array exercise

For each of the following array accesses to the array pictured below, determine if it is a valid access and, if so, what value it returns.4

  1. sea[2][3]
  2. sea[1][5]
  3. sea[2][-2]

array-multilevel.png

7 Background for Lab 2

  • register uses: %rax, %rsp, %rdi, %rsi, %rdx, %rcx, %r8, %r9
  • starting lab 2
    • hacking vs analyzing
    • either way, practice the skill of finding the relevant detail and ignoring the rest
  • basic commands: r(un), b(reak), stepi (si), nexti (ni), c(ontinue), layout asm, layout reg, disas, p(rint), x, finish
  • blank line at the end of defuse.txt
  • push/pop/moving %rsp at the beginning and end of functions
  • sscanf

8 gdb activity

  • Get started with these commands:
wget http://cs.carleton.edu/faculty/awb/cs208/s22/notes/gdb-activity.tar
tar xvf gdb-activity.tar
cd gdb-activity
make
  • Try running the program with ./gdb-activity, what happens?
  • Open gdb-activity.c
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

int compare(int a, int b);

int main(int argc, char** argv)
{
    int a, b, n;
    char input[100];
    printf("enter good args: ");
    if (fgets(input, 100, stdin) == NULL) {
        printf("I said good args\n");
    }

    n = sscanf(input, "%d %d", &a, &b);
    if (n == 2 && compare(a,b) == 1) {
        printf("good args!\n");
    }
    else {
        printf("bad args, try harder!\n");
    }
    return 0;
}

Observations:

  • four functions are called: printf, fgets, sscanf, and compare
    • the first three are C library functions (since they aren't declared anywhere, they must come from the #include of library headers)
    • look each library function up in the terminal with man 3 FUNCTION, or consult cplusplus.com/FUNCTION (the latter is often easier to understand)
  • to get the program to print "good args!", we need fgets to return something other than NULL, have sscanf return 2, and have compare(a, b) return 1
    • fgets will read from the command line (stdin) and store the string in input, up to 100 characters
      • only returns NULL on failure, so we probably don't have to worry about that
    • sscanf is a super useful function: it parses a string (the first parameter) according to a format string given by the second parameter
      • %d is the format specifier for an integer, so this sscanf call will parse input as two integers separated by a space
      • the parsed items (i.e., each %d) will be written to the corresponding pointers provided as arguments after the format string
        • so the first integer in input will be stored in a and the second will be stored in b
    • compare is actually implemented in raw assembly in gdb-activity.s
  • Lets use gdb to get a sense for how everything is fitting together. (This tutorial video goes over using gdb if you want to review.)
  • Running disas compare from within gdb gives
0x00000000004006d7 <+0>:     push   %rbx
0x00000000004006d8 <+1>:     mov    %rdi,%rbx
0x00000000004006db <+4>:     add    $0x5,%rbx
0x00000000004006df <+8>:     add    %rsi,%rbx
0x00000000004006e2 <+11>:    cmp    $0xd0,%rbx
0x00000000004006e9 <+18>:    sete   %al
0x00000000004006ec <+21>:    movzbq %al,%rax
0x00000000004006f0 <+25>:    pop    %rbx
0x00000000004006f1 <+26>:    retq
  • We can reverse engineer the C code to be something like
int compare(int a, int b) {
    return a + b + 5 == 0xd0; // we add %rdi, %rsi, and 5 together in %rbx and then compare it to $0xd0
                              // sete writes 1 to the given register if the cmp indicates the operands are equal
                              // since this is the return value, and we want compare to return 1, 
                              // we should choose inputs to make these equal
}

9 Activity

What affect does this assembly program have on registers and memory given the initial values below?5

f:
        movl    $1, (%rdi)
        movl    $1, 4(%rdi)
        movl    $2, %edx
        jmp     .L2
.L3:
        movslq  %edx, %rax
        salq    $2, %rax
        movl    -8(%rdi,%rax), %ecx
        addl    -4(%rdi,%rax), %ecx
        movl    %ecx, (%rdi,%rax)
        addl    $1, %edx
.L2:
        cmpl    %esi, %edx
        jl      .L3
        rep ret
main:
        subq    $32, %rsp
        movl    $7, %esi
        movq    %rsp, %rdi
        call    f
        movl    $0, %eax
        addq    $32, %rsp
        ret

lb12-activity-diagram-start.png

10 Exercise

Given the C code and the register to variable mapping below, see how far you can get filling in the corresponding assembly:6

for (long i = 0; i < size; i++) {
    total += arr[i];
}
Register Use
%rdi arr
%rsi size
%rdx i
%rax total
init:
        ________________
        ________________
body:
        ________________
        ________________
test:
        ________________
        ________________

11 Practice

CSPP practice problems 3.36 (p. 256) and 3.37 (p. 258)

Footnotes:

1

N * sizeof(T)

2
  1. valid, 0
  2. invalid, past the end of row 1 (and rows are not adjacent in memory in this multilevel array)
  3. invalid, past the start of row 2 (and rows are not adjacent in memory in this multilevel array)
3

2. is the false statement. The assembly for sea[1][1] will compute the address of that specific element and then make a single memory access to that address.

4
  1. valid, 9 (start of row 2, then 5 ints forward)
  2. valid, 5 (start of the non-existent row 4, then 1 int backwards, thereby reference the valid element at the end of row 3)
  3. valid, 5 (start of row 0, then 19 ints forward, referencing the final int in the array)
5

Here's video walkthrough. The panopto video is below (view it in panopto here), along with the initial memory/register diagram, the C code and assembly side-by-side in godbolt, and the final memory/register diagram.






6
init:
        movl $0, %edx
        jmp  test
body:
        addl (%rdi, %rdx, 4), %eax
        addq $1, %rdx
test:
        cmpq %rsi, %rdx
        jl   body