CS 208 s20 — C: Data Representation

Table of Contents

1 Reading: Descent Into C

This article provides an excellent overview of the strange world of C programming. After reading it you should have a sense of:

  • how arrays are laid out in memory (contiguously)
  • that pointers are treated the same as any other values, they just happen to be memory addresses
    • except arithmetic with pointers is special
  • C strings are just char arrays
  • what #include is for

2 C Arrays (and Strings)

Like many programming languages, C provides fixed-length arrays. These arrays occupy a contiguous chunk of memory with the elements in index order.

int a[6]; // declaration, stored at 0x10

// indexing:
a[0] = 0x015f; 
a[5] = a[0];

// No bounds checking
a[6] = 0xbad; // writes to 0x28
a[-1] = 0xbad; // writes to 0xc

// an "array" is just a pointer to an element 
int* p; // stored at 0x40

// these two lines are equivalent
p = a;
p = &a[0];

// write to a[0]
*p = 0xa;

// see subsection on strings
char str[] = "hi!!!!!";

This spreadsheet memory diagram shows one way the above code could affect memory. Note that this is now a 64-bit example, so memory addresses (and thus the pointer p) are 8 bytes wide. The row addresses and offsets work the same way as the previous examples, the rows are just 8 bytes instead of 4. This means the block of 24 bytes holding the 6 int's in the array a takes up 3 adjacent rows. Trace through the code and study the memory diagram—ask any questions you have in Slack!

I've put this into C Tutor here, if you'd like to see it in that context (it doesn't show the writes outside the array bounds).

2.1 C Strings

As previously mentioned, a C string is simply an array of char (array of 1-byte values) ending with a null terminator (0x00 or '\0'). The terminator value is crucial because otherwise C has no way of knowing where the string ends. If a string is missing the null terminator, C will keep reading bytes from memory until it happens to encounter 0x00.

In memory, characters are represented using their ASCII values (American Standard Code for Information Interchange), which has each letter or symbol correspond to a particular hex value (see the table below or run man ascii in your terminal).

ascii.png

Thus, the string "hi!!!!" is represented by the array of bytes

0x68	0x69	0x21	0x21	0x21	0x21	0x21	0x00

2.1.1 The C string library

  • #include <string.h> gives you access to some useful functions for working with C strings
  • http://cplusplus.com/reference is an excellent source of documentation
  • strlen(s) returns the length of the string (char *) s, not counting the null terminator
  • strcpy(dest, src) copies the string at dest to src (both arguments are of type char * (pointers to the start of a char array)
    • strcpy provides no protection if src is longer than destination, it will simply overwrite whatever is in memory after the end of the dest array
    • fortunately, strncpy lets you provide a maximum number of characters to copy

2.1.2 QUICK CHECK: how would you declare an array of three strings (i.e., what is the type signature)? 1

3 C Memory Management

3.1 Allocating Memory

There are two ways that memory gets allocated for data storage:

  1. Compile Time (or static) Allocation
    • Memory for named variables is allocated by the compiler
    • Exact size and type of storage must be known at compile time
    • For standard array declarations, this is why the size has to be constant
  2. Dynamic Memory Allocation
    • Memory allocated "on the fly" during run time
    • Dynamically allocated space usually placed in a program segment known as the heap or the free store
    • Exact amount of space or number of items does not have to be known by the compiler in advance.
    • For dynamic memory allocation, pointers are crucial

3.2 Dynamic Memory Allocation

  • We can dynamically allocate storage space while the program is running, but we cannot create new variable names "on the fly"
  • For this reason, dynamic allocation requires two steps:
    1. Creating the dynamic space.
    2. Storing its address in a pointer (so that the space can be accesed)
  • To dynamically allocate memory in C, we use the malloc function provided by stdlib.h
    • malloc takes in a number of bytes to allocate and returns a pointer to the newly allocated space
    • malloc does not do any initialization, so the contents of this memory can be whatever was last written to those bytes
      • this means you must always perform initialization on memory return by malloc
  • De-allocation:
    • Deallocation is the "clean-up" of space being used for variables or other data storage
    • Compile time variables are automatically deallocated based on their known extent
    • It is the programmer's job to deallocate dynamically created space
    • To de-allocate dynamic memory, we use the free function operator
      • free takes a pointer that was previously returned by malloc and makes that memory available for future allocation
      • free has undefined behavior is used with a pointer that wasn't returned by malloc or a pointer that was already freed

3.3 Memory Layout

Though memory is a long array of bytes, it is actually separated into various segments or memory regions. Memory allocated at compile time is located on the stack, which resides at high addresses in memory. As more data is added to the stack, the region grows to include lower addresses, so we say the stack grows down.

Memory allocated dynamically using malloc is placed on the heap, which resides somewhere in the middle of memory. As more data is added to the heap, the region grows to include higher addresses, so we say the heap grows up.

The other three regions (static data, literals, and instructions) are fixed in size and get initialized when a program starts running.

mem-layoutv2.png

This C Tutor example demonstrates how strings are allocated in various regions of memory. Note that C Tutor labels the heap-allocated string and the string literal as both being in the "Heap", but pay attention to the pointer values that get printed out. The stack-allocated string is located at a high address, the heap-allocated string at a middle address, and the static literal at a low address. This code also gives and example of using the C standard library to copy a string.

4 C Structures

While there are no objects in C, the programmer can still create new data types. The two ways to create data types in C are structures (struct) and unions (union) (we won't worry about unions in this course). A struct is a way of combining different types of data:

// a way of combining different types of data together
struct song {
    char *title;
    int length_in_seconds;
    int year_released;
};
struct song song1;
song1.title = "What is Urinetown?";  // probably my favorite musical
song1.length_in_seconds = 213;
song1.year_released = 2001;
  • variable declarations like any other type: struct name name1, *pn, name_ar[3];
  • common to use typedef to give the struct type a more concise alias2
    • typedef struct song song_t;
  • access fields with . or -> in the case of the pointer (p->field is shorthand for (*p).field)
  • like arrays, struct elements are stored in a contiguous region with a pointer to the first byte
    • sizeof(struct song)? 16 bytes
    • compiler maintains the byte offset information needed to access additional fields
    • can find offset of individual fields using offsetof(type, member)

4.1 Passing a struct

This C Tutor example shows the importance of using a pointer to pass a struct to a function rather than passing the struct itself. I've created a struct foo that contains a long and an array of 11 char. There's a function get that takes a struct foo and an index, and returns the char at that index. There's another function set that takes a struct foo, an index, and a char and sets the char at that index.

There are two versions of each of these functions, one that takes a struct foo and one that takes at struct foo*. Notice how with get, passing a struct foo results in the entire structure being copied. With set, it doesn't even work when passing a struct foo because the modification is done to the local copy.

5 Bringing It All Together

Here's an extended example playing around with a struct and heap vs stack allocation. Plug in into C Tutor or compile and run it yourself.

/* A demonstration of pointers and pass-by-value semantics in C
 * CS 208 Winter 2020
 * Aaron Bauer, Carleton College
 * compile with "gcc -Og -g -o point_test point_test.c"
 * to get consistent address values, run with "setarch x86_64 -R ./point_test"
 */

#include <stdio.h>
#include <stdlib.h>

typedef struct point {
  int x;
  int y;
} point_t;

void f(point_t p) {
  p.x++;
  printf("\nf: &p = %p\n", &p); // different address than &p in main, p has been copied
  printf("f: p = (%d, %d)\n\n", p.x, p.y);
}

void g(point_t *q) {
  q->x++;
  printf("\ng: &q = %p\n", &q); // different address than &q in main, q has been copied
  printf("g: q = %p\n", q); // but the value of q is the same (same heap address), the structure hasn't been copied
  printf("g: q = (%d, %d)\n\n", q->x, q->y);
}

int main() {
  // code stored at low addresses
  printf("&main = %p\n", &main);
  printf("&f = %p\n", &f);
  printf("&g = %p\n\n", &g);

  // p is stack-allocated (does not use malloc)
  point_t p = {5, 6};
  printf("&p = %p\n", &p);
  printf("p = (%d, %d)\n", p.x, p.y);
  f(p);
  printf("p = (%d, %d)\n\n", p.x, p.y); // mutation in f does not affect p, as p was copied

  point_t *q = (point_t*)malloc(sizeof(point_t));
  q->x = 3; // q->x is shorthand for (*q).x
  q->y = 4;
  printf("&q = %p\n", &q); // q is stack-allocated, &q is high in memory
  printf("q = %p\n", q); // q points to heap data, so it's value is a much lower address (higher than code)
  printf("q = (%d, %d)\n", q->x, q->y);
  g(q);
  printf("q = (%d, %d)\n", q->x, q->y); // mutation in g affects *q, same heap address in main and g
  free(q);
  printf("q = (%d, %d)\n", q->x, q->y); // undefined behavior to access freed memory
}

6 Homework

  1. Read through lab 0 and post any questions you have in Slack. Lab 0 is due Monday, April 20 at 9pm Central time.
  2. Take the Week 1 quiz on Moodle. You must complete it by 9pm Wednesday, April 15.

Footnotes:

1

an array of three strings would be declared by char *str_array[3];. This would only allocate space for the three pointers, however, not for the strings themselves. You could use malloc to allocate space on the heap for the actual char arrays.

2

A typedef statement introduces a shorthand name for a type. The syntax is…

typedef <type> <name>;

The following defines Fraction type to be the type (struct fraction). C is case sensitive, so fraction is different from Fraction. It's convenient to use typedef to create types with upper case names and use the lower-case version of the same word as a variable.

typedef struct fraction Fraction;
Fraction fraction;      // Declare the variable "fraction" of type "Fraction"
                        // which is really just a synonym for "struct fraction".

The following typedef defines the name Tree as a standard pointer to a binary tree node where each node contains some data and "smaller" and "larger" subtree pointers.

typedef struct treenode* Tree;
struct treenode {
    int data;
    Tree smaller, larger;        // equivalently, this line could say
};                               // "struct treenode *smaller, *larger"