Relational Databases for Biologists

Session 2: Mining a database with SQL

Exercise: Learn how to query a MySQL database to extract selected data relationships. Learn how to convert a question (in English) into a SQL query and vice versa.

Goals: Learn how to access a MySQL database and look at table structure and content. Do this by accessing a database on hebrides.wi.mit.edu, a remote Linux computer, from your laptop.

See /education/bioinfo2006/db4bio/ for the course page.

See the solutions to this exercise to check your work.

To connect to MySQL on hebrides, you have two choices:

a. You can connect directly from your laptop or desktop computer to MySQL on hebrides without logging on to hebrides. To do this, you need MySQL on your computer, which is the case for the class laptops. If you need to do this on your own computer, go to the most recent MySQL download page, select your operating system, download, and install. This will get you the MySQL database management system (DBMS), which has the 'mysql' command to connect you to another computer (or you can create a database on your own computer).

b. You can first log on to hebrides and then connect to MySQL. To do this, you need to have or install a ssh (secure shell) application. See Connecting to tak or hebrides or (if using Windows) install Cygwin as described on Using Cygwin/X for X Windows on Tak or Hebrides.

Note that you can install and use MySQL on your own computer. We have chosen to use MySQL on a Linux computer so everyone in the class can access the same database. If you ever use MySQL on your own computer, the MySQL queries would work just the same (but be perhaps somewhat slower).


Connect to MySQL on hebrides using either step 1a (recommended) or 1b, depending upon if you're doing step (a) or (b) above.

1a. Get to the Unix prompt (Mac OS X) or Unix-like prompt (Windows) on your laptop. If you're using Mac OS X, open the X11 terminal (Applications > Utilities > X11), which is also on the dock. If you're using Windows, open Cygwin, which is a "Unix emulator", meaning that it makes your Windows computer act as if it's a Unix computer.

In either case, you now have a terminal, essentially a place to send commands to your computer, including the possibility of connnecting to a remote computer.

Change directories ("cd", a Unix command) to go to the desktop:

cd "$USERPROFILE/Desktop"   (for Windows)
cd Desktop   (for Macintosh)

1b. From the terminal on your laptop, connect to Hebrides using the secure shell (ssh) command:

ssh -l username hebrides.wi.mit.edu

and you'll be prompted for your password.

3. To access MySQL on Hebrides, use this command to connect to the db4bio database:

mysql -u username -h hebrides.wi.mit.edu -D database -p
with the following options:

-u username (replace with your MySQL username [not necessarily your hebrides username])

-h host (where the "host" is hebrides.wi.mit.edu; not necessary if you're already connected to hebrides)

-D database (which takes you directly to the database we'll be using ("db4bio"); optional)

-p password (for which you will be prompted)

ex:   mysql -u guest99 -h hebrides.wi.mit.edu -D db4bio -p


Try some queries in increasing level of complexity, starting with some easy examples to show correct syntax.

Note: MySQL keywords (SELECT, FROM, etc.), data fields, and search strings are case-insensitive but other words (table names) ARE case sensitive.

SQL querying can be done any one of several ways:

  1. Type the SQL at the MySQL prompt.
  2. Copy the SQL and paste it into the MySQL prompt.
  3. Create a text file containing the SQL (called, for example, "query_1.sql") in your working directory. From the MySQL prompt, type
    mysql> source query_1.sql;
  4. Send the SQL to MySQL from the Unix prompt (without being logged in to MySQL):
    mysql -u username -h hebrides.wi.mit.edu -D db4bio -p < query_1.sql

Convert each of the following SQL queries into a question, and convert each of the following questions into SQL.

They are listed in order of increasing difficulty, and some are quite difficult.

You may wish to consult the answers.

1.

SELECT * from Sources;

2. What are some example GenBank IDs (accessions) and corresponding sequence descriptions?

3.

SELECT count(*) FROM LocusLinks;
(to get number of tuples in table)

4. How many different Affy IDs are in the expression data?

5.

SELECT description, species FROM LocusDescr
WHERE linkId = 655;
Note that 655 is of type integer (int), so it doesn't need quotes.

6. What's the expression level of Affy ID 33659_at in experiment hs-liv-1?

7.

SELECT * FROM LocusDescr WHERE description LIKE "phosphatase%";
Note that the "%" (wildcard) matches any character(s).

8. What are all the gene descriptions, along with their GenBank IDs, containing the phrase "growth factor"?

9.

SELECT * from Data WHERE exptId = "mm-hrt-1" and level > 6000
ORDER BY level DESC;

10. What Gene Ontology terms (and corresponding accessions) contain the phrase "transcription factor"? Put your answer in alphabetical order of terms.

11.

SELECT Data.affyId, Targets.gbId, Data.level
FROM Data, Targets
WHERE Data.affyId = Targets.affyId
ORDER BY Data.level DESC
LIMIT 10;

12. What Affy probes correspond to target sequences with the phrase "interleukin" in their description?

13.

SELECT Data.affyId, Targets.gbId, UniSeqs.uId, Data.level
FROM Data, Targets, UniSeqs
WHERE Data.affyId = Targets.affyId
AND Targets.gbId = UniSeqs.gbId
ORDER BY level DESC
LIMIT 10;

14. Make a table of ten affyId, UnigeneID, and Unigene descriptions in reverse alphabetical order of Unigene descriptions.

15.

SELECT affyId, AVG(level) AS average FROM Data
GROUP BY affyId
ORDER BY level DESC
LIMIT 10;
Note that AVG() is an aggregate [like COUNT(), MIN(), MAX(), and SUM()] and "AS" creates a field with the requested name. Since you're combining rows, you have to use GROUP to show how to do it.

16. What is the average expression level of each of the six experiments?

17.

SELECT Data.affyId, Data.level, Data.exptId, DataCopy.affyId, DataCopy.level, DataCopy.exptId
FROM Data, Data DataCopy
WHERE Data.level > 10 * DataCopy.level
AND Data.affyId=DataCopy.affyId
AND Data.affyId NOT LIKE "AFFX%"
LIMIT 10;
(To compare rows of the same table, we copy the table and compare the table to its copy.)

18. List ten non-control human Affy IDs (IDs without "AFFX") with higher expression in liver than in heart.

19.

SELECT Data.affyId,

Data.level AS Heart_level, DataCopy.level AS Brain_level,
Data.level - DataCopy.level AS Difference
FROM Data, Data DataCopy
WHERE Data.affyId= DataCopy.affyId
AND Data.exptId = "hs-hrt-1"
AND DataCopy.exptId = "hs-cer-1"
ORDER BY Difference DESC
LIMIT 10;

20.

SELECT Data.affyId, Targets.gbId, LocusLinks.
linkId, LocusDescr.description,
Data.level as Heart_level,
DataCopy.level as Brain_level,
Data.level - DataCopy.level AS Difference
FROM Data, Data DataCopy, LocusLinks, LocusDescr, Targets
WHERE Data.affyId= DataCopy.affyId
AND Data.exptId = "hs-hrt-1"
AND DataCopy.exptId = "hs-cer-1"
AND Data.affyId = Targets.affyId
AND Targets.gbId = LocusLinks.gbId
AND LocusLinks.linkId = LocusDescr.linkId
ORDER BY Difference DESC
LIMIT 10;


Questions or comments?   gbell@wi.mit.edu
Bioinformatics and Research Computing at Whitehead Institute