{ "metadata": { }, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
\n\n# Advanced SQL\n\nby [The Carpentries](https://training.galaxyproject.org/hall-of-fame/carpentries/), [Helena Rasche](https://training.galaxyproject.org/hall-of-fame/hexylena/), [Avans Hogeschool](https://training.galaxyproject.org/hall-of-fame/avans-atgm/)\n\nCC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)\n\n**Objectives**\n\n- How can I calculate sums, averages, and other summary values?\n- How can I combine data from multiple tables?\n- How should I format data in a database, and why?\n- How can I create, modify, and delete tables and data?\n- How can I access databases from programs written in Python?\n\n**Objectives**\n\n- Define aggregation and give examples of its use.\n- Write queries that compute aggregated values.\n- Trace the execution of a query that performs aggregation.\n- Explain how missing data is handled during aggregation.\n- Explain the operation of a query that joins two tables.\n- Explain how to restrict the output of a query containing a join to only include meaningful combinations of values.\n- Write queries that join tables on equal keys.\n- Explain what primary and foreign keys are, and why they are useful.\n- Explain what an atomic value is.\n- Distinguish between atomic and non-atomic values.\n- Explain why every value in a database should be atomic.\n- Explain what a primary key is and why every record should have one.\n- Identify primary keys in database tables.\n- Explain why database entries should not contain redundant information.\n- Identify redundant information in databases.\n- Write statements that create tables.\n- Write statements to insert, modify, and delete records.\n- Write short programs that execute SQL queries.\n- Trace the execution of a program that contains an SQL query.\n- Explain why most database applications are written in a general-purpose language rather than in SQL.\n\n**Time Estimation: 3H**\n
\n", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-0", "source": "
\n
💬 Comment
\n

This tutorial is significantly based on the Carpentries Databases and SQL lesson, which is licensed CC-BY 4.0.

\n

Abigail Cabunoc and Sheldon McKay (eds): “Software Carpentry: Using Databases and SQL.” Version 2017.08, August 2017,\ngithub.com/swcarpentry/sql-novice-survey, https://doi.org/10.5281/zenodo.838776

\n

Adaptations have been made to make this work better in a GTN/Galaxy environment.

\n
\n
\n
Agenda
\n

In this tutorial, we will cover:

\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "# This preamble sets up the sql \"magic\" for jupyter. Use %%sql in your cells to write sql!\n", "!python3 -m pip install ipython-sql sqlalchemy\n", "!wget -c http://swcarpentry.github.io/sql-novice-survey/files/survey.db\n", "import sqlalchemy\n", "engine = sqlalchemy.create_engine(\"sqlite:///survey.db\")\n", "%load_ext sql\n", "%sql sqlite:///survey.db\n", "%config SqlMagic.displaycon=False" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-2", "source": "

Aggregation

\n

We now want to calculate ranges and averages for our data.\nWe know how to select all of the dates from the Visited table:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-3", "source": [ "%%sql\n", "SELECT dated FROM Visited;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-4", "source": "

but to combine them,\nwe must use an aggregation function\nsuch as min or max.\nEach of these functions takes a set of records as input,\nand produces a single record as output:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "%%sql\n", "SELECT min(dated) FROM Visited;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-6", "source": "

\"SQL

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "%%sql\n", "SELECT max(dated) FROM Visited;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-8", "source": "

min and max are just two of\nthe aggregation functions built into SQL.\nThree others are avg,\ncount,\nand sum:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-9", "source": [ "%%sql\n", "SELECT avg(reading) FROM Survey WHERE quant = 'sal';" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-10", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-11", "source": [ "%%sql\n", "SELECT count(reading) FROM Survey WHERE quant = 'sal';" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-12", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-13", "source": [ "%%sql\n", "SELECT sum(reading) FROM Survey WHERE quant = 'sal';" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-14", "source": "

We used count(reading) here,\nbut we could just as easily have counted quant\nor any other field in the table,\nor even used count(*),\nsince the function doesn’t care about the values themselves,\njust how many values there are.

\n

SQL lets us do several aggregations at once.\nWe can,\nfor example,\nfind the range of sensible salinity measurements:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "%%sql\n", "SELECT min(reading), max(reading) FROM Survey WHERE quant = 'sal' AND reading <= 1.0;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-16", "source": "

We can also combine aggregated results with raw results,\nalthough the output might surprise you:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-17", "source": [ "%%sql\n", "SELECT person, count(*) FROM Survey WHERE quant = 'sal' AND reading <= 1.0;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-18", "source": "

Why does Lake’s name appear rather than Roerich’s or Dyer’s?\nThe answer is that when it has to aggregate a field,\nbut isn’t told how to,\nthe database manager chooses an actual value from the input set.\nIt might use the first one processed,\nthe last one,\nor something else entirely.

\n

Another important fact is that when there are no values to aggregate —\nfor example, where there are no rows satisfying the WHERE clause —\naggregation’s result is “don’t know”\nrather than zero or some other arbitrary value:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-19", "source": [ "%%sql\n", "SELECT person, max(reading), sum(reading) FROM Survey WHERE quant = 'missing';" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-20", "source": "

One final important feature of aggregation functions is that\nthey are inconsistent with the rest of SQL in a very useful way.\nIf we add two values,\nand one of them is null,\nthe result is null.\nBy extension,\nif we use sum to add all the values in a set,\nand any of those values are null,\nthe result should also be null.\nIt’s much more useful,\nthough,\nfor aggregation functions to ignore null values\nand only combine those that are non-null.\nThis behavior lets us write our queries as:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-21", "source": [ "%%sql\n", "SELECT min(dated) FROM Visited;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-22", "source": "

instead of always having to filter explicitly:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-23", "source": [ "%%sql\n", "SELECT min(dated) FROM Visited WHERE dated IS NOT NULL;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-24", "source": "

Aggregating all records at once doesn’t always make sense.\nFor example,\nsuppose we suspect that there is a systematic bias in our data,\nand that some scientists’ radiation readings are higher than others.\nWe know that this doesn’t work:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-25", "source": [ "%%sql\n", "SELECT person, count(reading), round(avg(reading), 2)\n", "FROM Survey\n", "WHERE quant = 'rad';" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-26", "source": "

because the database manager selects a single arbitrary scientist’s name\nrather than aggregating separately for each scientist.\nSince there are only five scientists,\nwe could write five queries of the form:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-27", "source": [ "%%sql\n", "SELECT person, count(reading), round(avg(reading), 2)\n", "FROM Survey\n", "WHERE quant = 'rad'\n", "AND person = 'dyer';" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-28", "source": "

but this would be tedious,\nand if we ever had a data set with fifty or five hundred scientists,\nthe chances of us getting all of those queries right is small.

\n

What we need to do is\ntell the database manager to aggregate the hours for each scientist separately\nusing a GROUP BY clause:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-29", "source": [ "%%sql\n", "SELECT person, count(reading), round(avg(reading), 2)\n", "FROM Survey\n", "WHERE quant = 'rad'\n", "GROUP BY person;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-30", "source": "

GROUP BY does exactly what its name implies:\ngroups all the records with the same value for the specified field together\nso that aggregation can process each batch separately.\nSince all the records in each batch have the same value for person,\nit no longer matters that the database manager\nis picking an arbitrary one to display\nalongside the aggregated reading values.

\n
\n
💡 Tip: Know Excel? It's just a pivot table.
\n

GROUP BY is basically just a pivot table for Excel users, it lets you build\nnice summary tables which aggregate your results.

\n

And if you didn’t already know the Excel equivalent, now you know what to\nlook for when you need it!

\n
\n

Just as we can sort by multiple criteria at once,\nwe can also group by multiple criteria.\nTo get the average reading by scientist and quantity measured,\nfor example,\nwe just add another field to the GROUP BY clause:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-31", "source": [ "%%sql\n", "SELECT person, quant, count(reading), round(avg(reading), 2)\n", "FROM Survey\n", "GROUP BY person, quant;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-32", "source": "

Note that we have added quant to the list of fields displayed,\nsince the results wouldn’t make much sense otherwise.

\n

Let’s go one step further and remove all the entries\nwhere we don’t know who took the measurement:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-33", "source": [ "%%sql\n", "SELECT person, quant, count(reading), round(avg(reading), 2)\n", "FROM Survey\n", "WHERE person IS NOT NULL\n", "GROUP BY person, quant\n", "ORDER BY person, quant;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-34", "source": "

Looking more closely,\nthis query:

\n
    \n
  1. selected records from the Survey table where the person field was not null;
  2. \n
  3. grouped those records into subsets so that the person and quant values in each subset were the same;
  4. \n
  5. ordered those subsets first by person, and then within each sub-group by quant; and
  6. \n
  7. counted the number of records in each subset, calculated the average reading in each, and chose a person and quant value from each (it doesn’t matter which ones, since they’re all equal).
  8. \n
\n
\n
❓ Question: Counting Temperature Readings
\n

How many temperature readings did Frank Pabodie record,\nand what was their average value?

\n
👁 View solution\n
👁 Solution
\n
SELECT count(reading), avg(reading) FROM Survey WHERE quant = 'temp' AND person = 'pb';\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n
count(reading)avg(reading)
2-20.0
\n
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-35", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-36", "source": "
\n

Averaging with NULL

\n

The average of a set of values is the sum of the values\ndivided by the number of values.\nDoes this mean that the avg function returns 2.0 or 3.0\nwhen given the values 1.0, null, and 5.0?

\n
👁 View solution\n
👁 Solution
\n

The answer is 3.0.\nNULL is not a value; it is the absence of a value.\nAs such it is not included in the calculation.

\n

You can confirm this, by executing this code:

\n
SELECT AVG(a) FROM (\n    SELECT 1 AS a\n    UNION ALL SELECT NULL\n    UNION ALL SELECT 5);\n
\n
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-37", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-38", "source": "
\n
❓ Question: What Does This Query Do?
\n

We want to calculate the difference between\neach individual radiation reading\nand the average of all the radiation readings.\nWe write the query:

\n
SELECT reading - avg(reading) FROM Survey WHERE quant = 'rad';\n
\n

What does this actually produce, and can you think of why?

\n
👁 View solution\n
👁 Solution
\n

The query produces only one row of results when we what we really want is a result for each of the readings.\nThe avg() function produces only a single value, and because it is run first, the table is reduced to a single row.\nThe reading value is simply an arbitrary one.

\n

To achieve what we wanted, we would have to run two queries:

\n
SELECT avg(reading) FROM Survey WHERE quant='rad';\n
\n

This produces the average value (6.5625), which we can then insert into a second query:

\n
SELECT reading - 6.5625 FROM Survey WHERE quant = 'rad';\n
\n

This produces what we want, but we can combine this into a single query using subqueries.

\n
SELECT reading - (SELECT avg(reading) FROM Survey WHERE quant='rad') FROM Survey WHERE quant = 'rad';\n
\n

This way we don’t have execute two queries.

\n

In summary what we have done is to replace avg(reading) with (SELECT avg(reading) FROM Survey WHERE quant='rad') in the original query.

\n
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-39", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-40", "source": "
\n
❓ Question: Ordering When Concatenating
\n

The function group_concat(field, separator)\nconcatenates all the values in a field\nusing the specified separator character\n(or ‘,’ if the separator isn’t specified).\nUse this to produce a one-line list of scientists’ names,\nsuch as:

\n
William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth\n
\n

Can you find a way to order the list by surname?

\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-41", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-42", "source": "

Combining Data

\n

In order to submit our data to a web site\nthat aggregates historical meteorological data,\nwe might need to format it as\nlatitude, longitude, date, quantity, and reading.\nHowever,\nour latitudes and longitudes are in the Site table,\nwhile the dates of measurements are in the Visited table\nand the readings themselves are in the Survey table.\nWe need to combine these tables somehow.

\n

This figure shows the relations between the tables:

\n

\"Survey

\n

The SQL command to do this is JOIN.\nTo see how it works,\nlet’s start by joining the Site and Visited tables:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-43", "source": [ "%%sql\n", "SELECT * FROM Site JOIN Visited;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-44", "source": "

JOIN creates\nthe cross product\nof two tables,\ni.e.,\nit joins each record of one table with each record of the other table\nto give all possible combinations.\nSince there are three records in Site\nand eight in Visited,\nthe join’s output has 24 records (3 * 8 = 24) .\nAnd since each table has three fields,\nthe output has six fields (3 + 3 = 6).

\n

What the join hasn’t done is\nfigure out if the records being joined have anything to do with each other.\nIt has no way of knowing whether they do or not until we tell it how.\nTo do that,\nwe add a clause specifying that\nwe’re only interested in combinations that have the same site name,\nthus we need to use a filter:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-45", "source": [ "%%sql\n", "SELECT * FROM Site JOIN Visited ON Site.name = Visited.site;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-46", "source": "

ON is very similar to WHERE,\nand for all the queries in this lesson you can use them interchangeably.\nThere are differences in how they affect [outer joins][outer],\nbut that’s beyond the scope of this lesson.\nOnce we add this to our query,\nthe database manager throws away records\nthat combined information about two different sites,\nleaving us with just the ones we want.

\n

Notice that we used Table.field to specify field names\nin the output of the join.\nWe do this because tables can have fields with the same name,\nand we need to be specific which ones we’re talking about.\nFor example,\nif we joined the Person and Visited tables,\nthe result would inherit a field called id\nfrom each of the original tables.

\n

We can now use the same dotted notation\nto select the three columns we actually want\nout of our join:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-47", "source": [ "%%sql\n", "SELECT Site.lat, Site.long, Visited.dated\n", "FROM Site JOIN Visited\n", "ON Site.name = Visited.site;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-48", "source": "

If joining two tables is good,\njoining many tables must be better.\nIn fact,\nwe can join any number of tables\nsimply by adding more JOIN clauses to our query,\nand more ON tests to filter out combinations of records\nthat don’t make sense:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-49", "source": [ "%%sql\n", "SELECT Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading\n", "FROM Site JOIN Visited JOIN Survey\n", "ON Site.name = Visited.site\n", "AND Visited.id = Survey.taken\n", "AND Visited.dated IS NOT NULL;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-50", "source": "

We can tell which records from Site, Visited, and Survey\ncorrespond with each other\nbecause those tables contain\nprimary keys\nand foreign keys.\nA primary key is a value,\nor combination of values,\nthat uniquely identifies each record in a table.\nA foreign key is a value (or combination of values) from one table\nthat identifies a unique record in another table.\nAnother way of saying this is that\na foreign key is the primary key of one table\nthat appears in some other table.\nIn our database,\nPerson.id is the primary key in the Person table,\nwhile Survey.person is a foreign key\nrelating the Survey table’s entries\nto entries in Person.

\n

Most database designers believe that\nevery table should have a well-defined primary key.\nThey also believe that this key should be separate from the data itself,\nso that if we ever need to change the data,\nwe only need to make one change in one place.\nOne easy way to do this is\nto create an arbitrary, unique ID for each record\nas we add it to the database.\nThis is actually very common:\nthose IDs have names like “student numbers” and “patient numbers”,\nand they almost always turn out to have originally been\na unique record identifier in some database system or other.\nAs the query below demonstrates,\nSQLite [automatically numbers records][rowid] as they’re added to tables,\nand we can use those record numbers in queries:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-51", "source": [ "%%sql\n", "SELECT rowid, * FROM Person;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-52", "source": "
\n
❓ Question: Listing Radiation Readings
\n

Write a query that lists all radiation readings from the DR-1 site.

\n
👁 View solution\n
👁 Solution
\n
SELECT Survey.reading\nFROM Site JOIN Visited JOIN Survey\nON Site.name = Visited.site\nAND Visited.id = Survey.taken\nWHERE Site.name = 'DR-1'\nAND Survey.quant = 'rad';\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
reading
9.82
7.8
11.25
\n
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-53", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-54", "source": "
\n
❓ Question: Where's Frank?
\n

Write a query that lists all sites visited by people named “Frank”.

\n
👁 View solution\n
👁 Solution
\n
SELECT DISTINCT Site.name\nFROM Site JOIN Visited JOIN Survey JOIN Person\nON Site.name = Visited.site\nAND Visited.id = Survey.taken\nAND Survey.person = Person.id\nWHERE Person.personal = 'Frank';\n
\n\n\n\n\n\n\n\n\n\n\n\n
name
DR-3
\n
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-55", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-56", "source": "
\n
❓ Question: Reading Queries
\n

Describe in your own words what the following query produces:

\n
SELECT Site.name FROM Site JOIN Visited\nON Site.lat < -49.0 AND Site.name = Visited.site AND Visited.dated >= '1932-01-01';\n
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-57", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-58", "source": "
\n
❓ Question: Who Has Been Where?
\n

Write a query that shows each site with exact location (lat, long) ordered by visited date,\nfollowed by personal name and family name of the person who visited the site\nand the type of measurement taken and its reading. Please avoid all null values.\nTip: you should get 15 records with 8 fields.

\n
👁 View solution\n
👁 Solution
\n
SELECT Site.name, Site.lat, Site.long, Person.personal, Person.family, Survey.quant, Survey.reading, Visited.dated\nFROM Site JOIN Visited JOIN Survey JOIN Person\nON Site.name = Visited.site\nAND Visited.id = Survey.taken\nAND Survey.person = Person.id\nWHERE Survey.person IS NOT NULL\nAND Visited.dated IS NOT NULL\nORDER BY Visited.dated;\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
namelatlongpersonalfamilyquantreadingdated
DR-1-49.85-128.57WilliamDyerrad9.821927-02-08
DR-1-49.85-128.57WilliamDyersal0.131927-02-08
DR-1-49.85-128.57WilliamDyerrad7.81927-02-10
DR-1-49.85-128.57WilliamDyersal0.091927-02-10
DR-3-47.15-126.72AndersonLakesal0.051930-01-07
DR-3-47.15-126.72FrankPabodierad8.411930-01-07
DR-3-47.15-126.72FrankPabodietemp-21.51930-01-07
DR-3-47.15-126.72FrankPabodierad7.221930-01-12
DR-3-47.15-126.72AndersonLakesal0.11930-02-26
DR-3-47.15-126.72FrankPabodierad4.351930-02-26
DR-3-47.15-126.72FrankPabodietemp-18.51930-02-26
MSK-4-48.87-123.4AndersonLakerad1.461932-01-14
MSK-4-48.87-123.4AndersonLakesal0.211932-01-14
MSK-4-48.87-123.4ValentinaRoerichsal22.51932-01-14
DR-1-49.85-128.57ValentinaRoerichrad11.251932-03-22
\n
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-59", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-60", "source": "

A good visual explanation of joins can be found in the SQL Join Visualizer

\n

Data Hygiene

\n

Now that we have seen how joins work, we can see why the relational\nmodel is so useful and how best to use it. The first rule is that\nevery value should be atomic, i.e., not\ncontain parts that we might want to work with separately. We store\npersonal and family names in separate columns instead of putting the\nentire name in one column so that we don’t have to use substring\noperations to get the name’s components. More importantly, we store\nthe two parts of the name separately because splitting on spaces is\nunreliable: just think of a name like “Eloise St. Cyr” or “Jan Mikkel\nSteubart”.

\n

The second rule is that every record should have a unique primary key.\nThis can be a serial number that has no intrinsic meaning,\none of the values in the record (like the id field in the Person table),\nor even a combination of values:\nthe triple (taken, person, quant) from the Survey table uniquely identifies every measurement.

\n

The third rule is that there should be no redundant information.\nFor example,\nwe could get rid of the Site table and rewrite the Visited table like this:

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
idlatlongdated
619-49.85-128.571927-02-08
622-49.85-128.571927-02-10
734-47.15-126.721930-01-07
735-47.15-126.721930-01-12
751-47.15-126.721930-02-26
752-47.15-126.72None
837-48.87-123.401932-01-14
844-49.85-128.571932-03-22
\n

In fact,\nwe could use a single table that recorded all the information about each reading in each row,\njust as a spreadsheet would.\nThe problem is that it’s very hard to keep data organized this way consistent:\nif we realize that the date of a particular visit to a particular site is wrong,\nwe have to change multiple records in the database.\nWhat’s worse,\nwe may have to guess which records to change,\nsince other sites may also have been visited on that date.

\n

The fourth rule is that the units for every value should be stored explicitly.\nOur database doesn’t do this,\nand that’s a problem:\nRoerich’s salinity measurements are several orders of magnitude larger than anyone else’s,\nbut we don’t know if that means she was using parts per million instead of parts per thousand,\nor whether there actually was a saline anomaly at that site in 1932.

\n

Stepping back,\ndata and the tools used to store it have a symbiotic relationship:\nwe use tables and joins because it’s efficient,\nprovided our data is organized a certain way,\nbut organize our data that way because we have tools to manipulate it efficiently.\nAs anthropologists say,\nthe tool shapes the hand that shapes the tool.

\n
\n
❓ Question: Identifying Atomic Values
\n

Which of the following are atomic values? Which are not? Why?

\n\n
👁 View solution\n
👁 Solution
\n

New Zealand is the only clear-cut atomic value.

\n

The address and the XY coordinate contain more than one piece of information\nwhich should be stored separately:

\n\n

The date entry is less clear cut, because it contains month, day, and year elements.\nHowever, there is a DATE datatype in SQL, and dates should be stored using this format.\nIf we need to work with the month, day, or year separately, we can use the SQL functions available for our database software\n(for example EXTRACT or STRFTIME for SQLite).

\n
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-61", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-62", "source": "
\n
❓ Question: Identifying a Primary Key
\n

What is the primary key in this table?\nI.e., what value or combination of values uniquely identifies a record?

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
latitudelongitudedatetemperature
57.3-22.52015-01-09-14.2
\n
👁 View solution\n
👁 Solution
\n

Latitude, longitude, and date are all required to uniquely identify the temperature record.

\n
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-63", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-64", "source": "

Creating and Modifying Data

\n

So far we have only looked at how to get information out of a database,\nboth because that is more frequent than adding information,\nand because most other operations only make sense\nonce queries are understood.\nIf we want to create and modify data,\nwe need to know two other sets of commands.

\n

The first pair are [CREATE TABLE][create-table] and [DROP TABLE][drop-table].\nWhile they are written as two words,\nthey are actually single commands.\nThe first one creates a new table;\nits arguments are the names and types of the table’s columns.\nFor example,\nthe following statements create the four tables in our survey database:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-65", "source": [ "%%sql\n", "CREATE TABLE Person(id text, personal text, family text);\n", "CREATE TABLE Site(name text, lat real, long real);\n", "CREATE TABLE Visited(id integer, site text, dated text);\n", "CREATE TABLE Survey(taken integer, person text, quant text, reading real);" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-66", "source": "

We can get rid of one of our tables using:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-67", "source": [ "%%sql\n", "DROP TABLE Survey;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-68", "source": "

Be very careful when doing this:\nif you drop the wrong table, hope that the person maintaining the database has a backup,\nbut it’s better not to have to rely on it.

\n

Different database systems support different data types for table columns,\nbut most provide the following:

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
data typeuse
INTEGERa signed integer
REALa floating point number
TEXTa character string
BLOBa “binary large object”, such as an image
\n

Most databases also support Booleans and date/time values;\nSQLite uses the integers 0 and 1 for the former,\nand represents the latter as text or numeric fields.

\n

An increasing number of databases also support geographic data types,\nsuch as latitude and longitude.\nKeeping track of what particular systems do or do not offer,\nand what names they give different data types,\nis an unending portability headache.

\n
\n
💡 Tip: Which database should I use?
\n

SQLite is fantastic for small databases or embedded into applications where\nyou want to be able to use SQL to query and process data.

\n

However for any real analysis PostgreSQL is usually the best choice, it\nscales incredibly well and can meet a wide range of use cases. It has good\ndata type support.

\n
\n
\n
💡 Tip: Do you have geographic data?
\n

Use Postgres. The PostGIS library is fantastic and industry standard for storing geographic data in a database.

\n
\n

When we create a table,\nwe can specify several kinds of constraints on its columns.\nFor example,\na better definition for the Survey table would be:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-69", "source": [ "%%sql\n", "CREATE TABLE Survey(\n", " taken integer not null, -- where reading taken\n", " person text, -- may not know who took it\n", " quant text not null, -- the quantity measured\n", " reading real not null, -- the actual reading\n", " primary key(taken, quant),\n", " foreign key(taken) references Visited(id),\n", " foreign key(person) references Person(id)\n", ");" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-70", "source": "

Once again,\nexactly what constraints are available\nand what they’re called\ndepends on which database manager we are using.

\n

Once tables have been created,\nwe can add, change, and remove records using our other set of commands,\nINSERT, UPDATE, and DELETE.

\n

Here is an example of inserting rows into the Site table:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-71", "source": [ "%%sql\n", "INSERT INTO Site (name, lat, long) VALUES ('DR-1', -49.85, -128.57);\n", "INSERT INTO Site (name, lat, long) VALUES ('DR-3', -47.15, -126.72);\n", "INSERT INTO Site (name, lat, long) VALUES ('MSK-4', -48.87, -123.40);" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-72", "source": "

We can also insert values into one table directly from another:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-73", "source": [ "%%sql\n", "CREATE TABLE JustLatLong(lat real, long real);\n", "INSERT INTO JustLatLong SELECT lat, long FROM Site;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-74", "source": "

Modifying existing records is done using the UPDATE statement.\nTo do this we tell the database which table we want to update,\nwhat we want to change the values to for any or all of the fields,\nand under what conditions we should update the values.

\n

For example, if we made a mistake when entering the lat and long values\nof the last INSERT statement above, we can correct it with an update:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-75", "source": [ "%%sql\n", "UPDATE Site SET lat = -47.87, long = -122.40 WHERE name = 'MSK-4';" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-76", "source": "

Be careful to not forget the WHERE clause or the update statement will\nmodify all of the records in the database.

\n

Deleting records can be a bit trickier,\nbecause we have to ensure that the database remains internally consistent.\nIf all we care about is a single table,\nwe can use the DELETE command with a WHERE clause\nthat matches the records we want to discard.\nFor example,\nonce we realize that Frank Danforth didn’t take any measurements,\nwe can remove him from the Person table like this:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-77", "source": [ "%%sql\n", "DELETE FROM Person WHERE id = 'danforth';" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-78", "source": "

But what if we removed Anderson Lake instead?\nOur Survey table would still contain seven records\nof measurements he’d taken,\nbut that’s never supposed to happen:\nSurvey.person is a foreign key into the Person table,\nand all our queries assume there will be a row in the latter\nmatching every value in the former.

\n

This problem is called referential integrity:\nwe need to ensure that all references between tables can always be resolved correctly.\nOne way to do this is to delete all the records\nthat use 'lake' as a foreign key\nbefore deleting the record that uses it as a primary key.\nIf our database manager supports it,\nwe can automate this\nusing cascading delete.\nHowever,\nthis technique is outside the scope of this chapter.

\n
\n
💡 Tip: Hybrid Storage Models
\n

Many applications use a hybrid storage model\ninstead of putting everything into a database:\nthe actual data (such as astronomical images) is stored in files,\nwhile the database stores the files’ names,\ntheir modification dates,\nthe region of the sky they cover,\ntheir spectral characteristics,\nand so on.\nThis is also how most music player software is built:\nthe database inside the application keeps track of the MP3 files,\nbut the files themselves live on disk.

\n
\n
\n
❓ Question: Replacing NULL
\n

Write an SQL statement to replace all uses of null in\nSurvey.person with the string 'unknown'.

\n
👁 View solution\n
👁 Solution
\n
UPDATE Survey SET person = 'unknown' WHERE person IS NULL;\n
\n
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-79", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-80", "source": "
\n
❓ Question: Backing Up with SQL
\n

SQLite has several administrative commands that aren’t part of the\nSQL standard. One of them is .dump, which prints the SQL commands\nneeded to re-create the database. Another is .read, which reads a\nfile created by .dump and restores the database. A colleague of\nyours thinks that storing dump files (which are text) in version\ncontrol is a good way to track and manage changes to the database.\nWhat are the pros and cons of this approach? (Hint: records aren’t\nstored in any particular order.)

\n
👁 View solution\n
👁 Solution
\n

Advantages

\n\n

Disadvantages

\n\n
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-81", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-82", "source": "", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- Use aggregation functions to combine multiple values.\n", "- Aggregation functions ignore `null` values.\n", "- Aggregation happens after filtering.\n", "- Use GROUP BY to combine subsets separately.\n", "- If no aggregation function is specified for a field, the query may return an arbitrary value for that field.\n", "- Use JOIN to combine data from two tables.\n", "- Use table.field notation to refer to fields when doing joins.\n", "- Every fact should be represented in a database exactly once.\n", "- A join produces all combinations of records from one table with records from another.\n", "- A primary key is a field (or set of fields) whose values uniquely identify the records in a table.\n", "- A foreign key is a field (or set of fields) in one table whose values are a primary key in another table.\n", "- We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables.\n", "- The most common join condition is matching keys.\n", "- Every value in a database should be atomic.\n", "- Every record should have a unique primary key.\n", "- A database should not contain redundant information.\n", "- Units and similar metadata should be stored with the data.\n", "- Use CREATE and DROP to create and delete tables.\n", "- Use INSERT to add data.\n", "- Use UPDATE to modify existing data.\n", "- Use DELETE to remove data.\n", "- It is simpler and safer to modify data when every record has a unique primary key.\n", "- Do not create dangling references by deleting records that other records refer to.\n", "- General-purpose languages have libraries for accessing databases.\n", "- To connect to a database, a program must use a library specific to that database manager.\n", "- These libraries use a connection-and-cursor model.\n", "- Programs can read query results in batches or all at once.\n", "- Queries should be written using parameter substitution, not string formatting.\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/sql-advanced/tutorial.html#feedback) and check there for further resources!\n" ] } ] }