CS 208 s20 — C: Data Representation
Table of Contents
1 Reading: Descent Into C
This article provides an excellent overview of the strange world of C programming. After reading it you should have a sense of:
- how arrays are laid out in memory (contiguously)
- that pointers are treated the same as any other values, they just happen to be memory addresses
- except arithmetic with pointers is special
- C strings are just
char
arrays - what
#include
is for
2 C Arrays (and Strings)
Like many programming languages, C provides fixed-length arrays. These arrays occupy a contiguous chunk of memory with the elements in index order.
int a[6]; // declaration, stored at 0x10 // indexing: a[0] = 0x015f; a[5] = a[0]; // No bounds checking a[6] = 0xbad; // writes to 0x28 a[-1] = 0xbad; // writes to 0xc // an "array" is just a pointer to an element int* p; // stored at 0x40 // these two lines are equivalent p = a; p = &a[0]; // write to a[0] *p = 0xa; // see subsection on strings char str[] = "hi!!!!!";
This spreadsheet memory diagram shows one way the above code could affect memory.
Note that this is now a 64-bit example, so memory addresses (and thus the pointer p
) are 8 bytes wide.
The row addresses and offsets work the same way as the previous examples, the rows are just 8 bytes instead of 4.
This means the block of 24 bytes holding the 6 int
's in the array a
takes up 3 adjacent rows.
Trace through the code and study the memory diagram—ask any questions you have in Slack!
I've put this into C Tutor here, if you'd like to see it in that context (it doesn't show the writes outside the array bounds).
2.1 C Strings
As previously mentioned, a C string is simply an array of char
(array of 1-byte values) ending with a null terminator (0x00
or '\0'
).
The terminator value is crucial because otherwise C has no way of knowing where the string ends.
If a string is missing the null terminator, C will keep reading bytes from memory until it happens to encounter 0x00
.
In memory, characters are represented using their ASCII values (American Standard Code for Information Interchange), which has each letter or symbol correspond to a particular hex value (see the table below or run man ascii
in your terminal).
Thus, the string "hi!!!!"
is represented by the array of bytes
0x68 0x69 0x21 0x21 0x21 0x21 0x21 0x00
2.1.1 The C string library
#include <string.h>
gives you access to some useful functions for working with C strings- http://cplusplus.com/reference is an excellent source of documentation
strlen(s)
returns the length of the string (char *
)s
, not counting the null terminatorstrcpy(dest, src)
copies the string atdest
tosrc
(both arguments are of typechar *
(pointers to the start of achar
array)strcpy
provides no protection ifsrc
is longer than destination, it will simply overwrite whatever is in memory after the end of thedest
array- fortunately,
strncpy
lets you provide a maximum number of characters to copy
2.1.2 QUICK CHECK: how would you declare an array of three strings (i.e., what is the type signature)? 1
3 C Memory Management
3.1 Allocating Memory
There are two ways that memory gets allocated for data storage:
- Compile Time (or static) Allocation
- Memory for named variables is allocated by the compiler
- Exact size and type of storage must be known at compile time
- For standard array declarations, this is why the size has to be constant
- Dynamic Memory Allocation
- Memory allocated "on the fly" during run time
- Dynamically allocated space usually placed in a program segment known as the heap or the free store
- Exact amount of space or number of items does not have to be known by the compiler in advance.
- For dynamic memory allocation, pointers are crucial
3.2 Dynamic Memory Allocation
- We can dynamically allocate storage space while the program is running, but we cannot create new variable names "on the fly"
- For this reason, dynamic allocation requires two steps:
- Creating the dynamic space.
- Storing its address in a pointer (so that the space can be accesed)
- To dynamically allocate memory in C, we use the
malloc
function provided bystdlib.h
malloc
takes in a number of bytes to allocate and returns a pointer to the newly allocated spacemalloc
does not do any initialization, so the contents of this memory can be whatever was last written to those bytes- this means you must always perform initialization on memory return by
malloc
- this means you must always perform initialization on memory return by
- De-allocation:
- Deallocation is the "clean-up" of space being used for variables or other data storage
- Compile time variables are automatically deallocated based on their known extent
- It is the programmer's job to deallocate dynamically created space
- To de-allocate dynamic memory, we use the
free
function operatorfree
takes a pointer that was previously returned bymalloc
and makes that memory available for future allocationfree
has undefined behavior is used with a pointer that wasn't returned bymalloc
or a pointer that was already freed
3.3 Memory Layout
Though memory is a long array of bytes, it is actually separated into various segments or memory regions. Memory allocated at compile time is located on the stack, which resides at high addresses in memory. As more data is added to the stack, the region grows to include lower addresses, so we say the stack grows down.
Memory allocated dynamically using malloc
is placed on the heap, which resides somewhere in the middle of memory.
As more data is added to the heap, the region grows to include higher addresses, so we say the heap grows up.
The other three regions (static data, literals, and instructions) are fixed in size and get initialized when a program starts running.
This C Tutor example demonstrates how strings are allocated in various regions of memory. Note that C Tutor labels the heap-allocated string and the string literal as both being in the "Heap", but pay attention to the pointer values that get printed out. The stack-allocated string is located at a high address, the heap-allocated string at a middle address, and the static literal at a low address. This code also gives and example of using the C standard library to copy a string.
4 C Structures
While there are no objects in C, the programmer can still create new data types.
The two ways to create data types in C are structures (struct
) and unions (union
) (we won't worry about unions in this course).
A struct
is a way of combining different types of data:
// a way of combining different types of data together struct song { char *title; int length_in_seconds; int year_released; }; struct song song1; song1.title = "What is Urinetown?"; // probably my favorite musical song1.length_in_seconds = 213; song1.year_released = 2001;
- variable declarations like any other type:
struct name name1, *pn, name_ar[3];
- common to use
typedef
to give the struct type a more concise alias2typedef struct song song_t;
- access fields with
.
or->
in the case of the pointer (p->field
is shorthand for(*p).field)
- like arrays, struct elements are stored in a contiguous region with a pointer to the first byte
sizeof(struct song)
? 16 bytes- compiler maintains the byte offset information needed to access additional fields
- can find offset of individual fields using
offsetof(type, member)
4.1 Passing a struct
This C Tutor example shows the importance of using a pointer to pass a struct
to a function rather than passing the struct
itself.
I've created a struct foo
that contains a long
and an array of 11 char
.
There's a function get
that takes a struct foo
and an index, and returns the char
at that index.
There's another function set
that takes a struct foo
, an index, and a char
and sets the char
at that index.
There are two versions of each of these functions, one that takes a struct foo
and one that takes at struct foo*
.
Notice how with get
, passing a struct foo
results in the entire structure being copied.
With set
, it doesn't even work when passing a struct foo
because the modification is done to the local copy.
5 Bringing It All Together
Here's an extended example playing around with a struct
and heap vs stack allocation.
Plug in into C Tutor or compile and run it yourself.
/* A demonstration of pointers and pass-by-value semantics in C * CS 208 Winter 2020 * Aaron Bauer, Carleton College * compile with "gcc -Og -g -o point_test point_test.c" * to get consistent address values, run with "setarch x86_64 -R ./point_test" */ #include <stdio.h> #include <stdlib.h> typedef struct point { int x; int y; } point_t; void f(point_t p) { p.x++; printf("\nf: &p = %p\n", &p); // different address than &p in main, p has been copied printf("f: p = (%d, %d)\n\n", p.x, p.y); } void g(point_t *q) { q->x++; printf("\ng: &q = %p\n", &q); // different address than &q in main, q has been copied printf("g: q = %p\n", q); // but the value of q is the same (same heap address), the structure hasn't been copied printf("g: q = (%d, %d)\n\n", q->x, q->y); } int main() { // code stored at low addresses printf("&main = %p\n", &main); printf("&f = %p\n", &f); printf("&g = %p\n\n", &g); // p is stack-allocated (does not use malloc) point_t p = {5, 6}; printf("&p = %p\n", &p); printf("p = (%d, %d)\n", p.x, p.y); f(p); printf("p = (%d, %d)\n\n", p.x, p.y); // mutation in f does not affect p, as p was copied point_t *q = (point_t*)malloc(sizeof(point_t)); q->x = 3; // q->x is shorthand for (*q).x q->y = 4; printf("&q = %p\n", &q); // q is stack-allocated, &q is high in memory printf("q = %p\n", q); // q points to heap data, so it's value is a much lower address (higher than code) printf("q = (%d, %d)\n", q->x, q->y); g(q); printf("q = (%d, %d)\n", q->x, q->y); // mutation in g affects *q, same heap address in main and g free(q); printf("q = (%d, %d)\n", q->x, q->y); // undefined behavior to access freed memory }
6 Homework
- Read through lab 0 and post any questions you have in Slack. Lab 0 is due Monday, April 20 at 9pm Central time.
- Take the Week 1 quiz on Moodle. You must complete it by 9pm Wednesday, April 15.
Footnotes:
an array of three strings would be declared by char *str_array[3];
. This would only allocate space for the three pointers, however, not for the strings themselves. You could use malloc
to allocate space on the heap for the actual char
arrays.
A typedef statement introduces a shorthand name for a type. The syntax is…
typedef <type> <name>;
The following defines Fraction
type to be the type (struct fraction
).
C is case sensitive, so fraction
is different from Fraction
.
It's convenient to use typedef
to create types with upper case names and use the lower-case version of the same word as a variable.
typedef struct fraction Fraction; Fraction fraction; // Declare the variable "fraction" of type "Fraction" // which is really just a synonym for "struct fraction".
The following typedef
defines the name Tree
as a standard pointer to a binary tree node where each node contains some data and "smaller" and "larger" subtree pointers.
typedef struct treenode* Tree; struct treenode { int data; Tree smaller, larger; // equivalently, this line could say }; // "struct treenode *smaller, *larger"