Lab: Database Interfaces and SQL Injection

Table of Contents

1 Administrivia

  • I have restructured the second half of the course somewhat:
    • Major change: project 2 will go out after midterm break
    • Lab topics and other deadlines have been adjusted accordingly
  • Midterm will be posted on Monday, due 9pm Friday, April 30
    • No time limit
    • Open note, open book, open anything linked from the course web page, open computer (writing code/queries, doing math)
    • Closed internet (other than course web page links), closed any other person besides me
    • A mix of short answer and multiple choice questions covering material from the first four weeks
    • Will be on Gradescope, like HW2
      • Gradescope has a strange interface: saving your answers "submits" the work you've done so far. You continue working by clicking on "Resubmit". You can submit/resubmit any number of times.

2 Command-Line Interface

  • This is all we've worked with up to this point
  • SQLite is a "serverless" database system, meaning I just need the sqlite3 executable and a database file
    • No need for the database system to already be running and configured, we're not connecting to anything
    $ sqlite3 musicbrains-cs344-s21.db
    SQLite version 3.28.0 2019-04-15 14:49:49
    Enter ".help" for usage hints.
    sqlite>
    
  • .help shows us all sorts of useful things:
    • .tables and .indexes to list tables and indexes
    • .dump to generate SQL to create tables and insert data
    • .timer on to display timing information after each query
    • .headers on to display column headers on output
    • .excel to direct output to a CSV file and open the file upon query completion
    • .output FILE to send output to a file
    • .expert is a neat experimental feature that will suggest useful indexes for a query
      sqlite> .expert
      sqlite> select artist.name, count(distinct artist_alias.name) as count
         ...> from artist, artist_alias, area
         ...> where artist.id = artist_alias.artist and artist.area = area.id and area.name = 'United Kingdom' and artist.begin_date_year > 1950
         ...> group by artist.id
         ...> order by count desc
         ...> limit 10;
      CREATE INDEX artist_alias_idx_00643fc7 ON artist_alias(artist);
      CREATE INDEX artist_idx_f8d047e7 ON artist(area, begin_date_year);
      CREATE INDEX area_idx_22db824a ON area(name, id);
      
      SEARCH TABLE area USING COVERING INDEX area_idx_22db824a (name=?)
      SEARCH TABLE artist USING INDEX artist_idx_f8d047e7 (area=? AND begin_date_year>?)
      SEARCH TABLE artist_alias USING INDEX artist_alias_idx_00643fc7 (artist=?)
      USE TEMP B-TREE FOR GROUP BY
      USE TEMP B-TREE FOR ORDER BY
      
      • A covering index is an index that contains all the attributes involved in a query. This can improve performance because a query can retrieve all the information it needs from the index and never have to go to the original data file
    • Let's see how much the expert advice speeds up this query:
      sqlite> .timer on
      sqlite> select artist.name, count(distinct artist_alias.name) as count
         ...> from artist, artist_alias, area
         ...> where artist.id = artist_alias.artist and artist.area = area.id and area.name = 'United Kingdom' and artist.begin_date_year > 1950
         ...> group by artist.id
         ...> order by count desc
         ...> limit 10;
      name|count
      The Beatles|35
      Judas Priest|16
      Deep Purple|14
      Orchestral Manoeuvres in the Dark|14
      The KLF|14
      Led Zeppelin|13
      T. Rex|13
      National Philharmonic Orchestra|13
      Chris Barber’s Jazz Band|12
      Heller & Farley|12
      Run Time: real 2.239 user 2.154869 sys 0.074757
      sqlite> CREATE INDEX artist_alias_idx_00643fc7 ON artist_alias(artist);
      Run Time: real 0.184 user 0.144469 sys 0.037718
      sqlite> CREATE INDEX artist_idx_f8d047e7 ON artist(area, begin_date_year);
      Run Time: real 1.424 user 1.169061 sys 0.229029
      sqlite> CREATE INDEX area_idx_22db824a ON area(name, id);
      Run Time: real 0.082 user 0.066150 sys 0.013522
      sqlite> select artist.name, count(distinct artist_alias.name) as count
         ...> from artist, artist_alias, area
         ...> where artist.id = artist_alias.artist and artist.area = area.id and area.name = 'United Kingdom' and artist.begin_date_year > 1950
         ...> group by artist.id
         ...> order by count desc
         ...> limit 10;
      name|count
      The Beatles|35
      Judas Priest|16
      Deep Purple|14
      Orchestral Manoeuvres in the Dark|14
      The KLF|14
      Led Zeppelin|13
      T. Rex|13
      National Philharmonic Orchestra|13
      Chris Barber’s Jazz Band|12
      Heller & Farley|12
      Run Time: real 0.094 user 0.056785 sys 0.035941
      
    • Ok, but what did this cost us in terms of storage space?
      • SQLite provides a dbstat table that contains information about database internals
        sqlite> select sum(pgsize)/(1024*1024) from dbstat where name like '%idx%';
        23
        
      • The pages for the indexes are taking up about 23 MB
      • To verify, we check the size of the .db file
        -rw-r--r--    1 awb  awb   548M Apr 23 12:06 musicbrainz-cs334-s21.db
        
      • Delete the indexes
        sqlite> drop index area_idx_22db824a;
        sqlite> drop index artist_idx_f8d047e7;
        sqlite> drop index artist_alias_idx_00643fc7;
        
      • Check file size again
        -rw-r--r--    1 awb  awb   548M Apr 23 12:17 musicbrainz-cs334-s21.db
        
      • Hmm, unchanged. What does dbstat say?
        sqlite> select sum(pgsize)/(1024*1024) from dbstat where name like '%idx%';
        
        sqlite> select sum(pgsize)/(1024*1024) from dbstat;
        524
        
      • dbstat indicates no pages are being used for the indexes and that the total size is 524 MB. What's going on here is that SQLite doesn't automatically compact the database file. Right now the pages that were used for the indexes are still sitting around. We can use the VACUUM statement to tell SQLite to compact the database file into the minimum size.
        sqlite> VACUUM;
        
        -rw-r--r--    1 awb  awb   525M Apr 23 12:22 musicbrainz-cs334-s21.db
        

3 GUI Interface: pgAdmin

  • Database systems typically provide a GUI interface as well
  • For Postgres, this is the pgAdmin application
  • Since Postgres is a client-server database, it needs to already be running, and we use pgAdmin to connect to it
  • Can use this for all kinds of administrative tasks, monitoring, writing queries, and analyzing performance
    • Even manually editing data!
    • For our last lab of the term we'll take a closer look at the configuration and administrative side
  • If you'd like to play around with pgAdmin, they have an online demo version: https://www.pgadmin.org/try/

4 Programming Interface: Python + sqlite3

  • Of course it's very common to need to write software that interacts with a database
  • There are standard APIs that mainstream database systems support
    • Open Database Connectivity (ODBC) for C
    • Java Database Connectivity (JDBC) for Java, based on ODBC
    • These are just interfaces, and a library would implement the API for a specific database system
    • Python has a built-in sqlite3 module (https://docs.python.org/3/library/sqlite3.html), and modules exists for Postgres, MySQL, etc.
  • So let's take Python + SQLite for a spin
    import sqlite3
    
    con = sqlite3.connect("musicbrainz-cs334-s21.db")
    q = con.execute("select * from medium_format")
    print(q.fetchall())
    
  • Just running a query is something we could do without Python, so let's make it more interesting by doing something that's much easier in Python than SQLite
  • For example, SQLite doesn't have a funtion to reverse a string, so if we wanted to reverse all the format names, we could do
    import sqlite3
    
    con = sqlite3.connect("musicbrainz-cs334-s21.db")
    q = con.execute("SELECT * FROM medium_format")
    
    formats = q.fetchall()
    
    for i, name, description in formats:
        print(name)
        con.execute(f"UPDATE medium_format SET name='{name[::-1]}' WHERE id={i}")
    
    con.commit()
    con.close()
    
  • We could even get the user in on the fun
    import sqlite3
    
    con = sqlite3.connect("musicbrainz-cs334-s21.db")
    
    artist = input("What artist would you like to search for? ")
    q = con.execute(f"""SELECT release.name 
                        FROM release, artist_credit 
                        WHERE release.artist_credit = artist_credit.id 
                          AND artist_credit.name='{artist}'""")
    print(f"{artist} has these releases in the database:")
    for name in q:
        print(name)
    
    con.commit()
    con.close()
    
  • The way I have handled user input is a BAD IDEA™
    $ python3 sqlite-fun.py                                                             
    What artist would you like to search for? ' UNION SELECT name from release_status; --
    ' UNION SELECT name from release_status; -- has these releases in the database:
    ('Bootleg',)
    ('Official',)
    ('Promotion',)
    ('Pseudo-Release',)
    
  • This is called a SQL injection attack (https://xkcd.com/327/)
    • The user was able to gain access to unrelated information in the database
    • Imagine if there was private customer data or credit card information or passwords stored in another table…
  • We can rewrite the Python to protect against this
    import sqlite3
    
    con = sqlite3.connect("musicbrainz-cs334-s21.db")
    con.set_trace_callback(print)
    artist = input("What artist would you like to search for? ")
    q = con.execute(f"""SELECT release.name 
                        FROM release, artist_credit 
                        WHERE release.artist_credit = artist_credit.id 
                          AND artist_credit.name=?""", (artist,))
    print(f"{artist} has these releases in the database:")
    for name in q:
        print(name)
    
    con.commit()
    con.close()
    

5 SQL Injection

A great site about SQL Injection with examples and explanations: http://databases.cs.virginia.edu/sqlinject/