Relational Databases
for Biologists
Session 2: SQL to data mine a database
Exercises
Goals: Learn how to query a mySQL database to
extract selected data relationships. Learn how to convert a question (in
English) into a SQL query and vice versa.
See /education/bioinfo2005/db4bio/
for the course page.
See http://www.mysql.com/documentation/
for
mySQL documentation.
See http://www.mysql.com/doc/en/Tutorial.html
for
an SQL tutorial.
Log in to the
Òdb4bioÓ mySQL database on iona:
mysql
–u username –D db4bio –p
with the
following options:
-h
host (the computer with the mySQL database) = hebrides.wi.mit.edu
-u
username (replace with your mySQL username)
-D
database (Òdb4bioÓ; optional – you can select the database later)
-p
password (youÕll be prompted for it)
Try some queries
in increasing level of complexity, starting with some easy examples to show
correct syntax. Note: mySQL keywords (SELECT, FROM, etc.), data fields, and
search strings are case-insensitive but other words (table names) ARE case
sensitive.
A query can
always be saved as a file, e.g., myquery.sql. Then run it from the mySQL prompt
by typing
SOURCE
myquery.sql
Convert each
of the following SQL queries into a question, and convert each of the following
questions into SQL. They are listed in order of increasing difficulty, and some
are quite difficult.
1. SELECT
* from Sources;
Get
every attribute from each tuple of the Sources table.
2. What are some
example GenBank IDs (accessions) and corresponding sequence descriptions?
SELECT
* FROM Descriptions LIMIT 5;
3. SELECT
count(*) FROM LocusLinks; (to get number of tuples in table)
Count
= 70153
4. How many
different Affy IDs are in the expression data?
SELECT
COUNT(affyId) FROM Data;
5. SELECT
description, species FROM LocusDescr
WHERE
linkId = 655;
Note
that 655 is type integer (int), so it doesnÕt need quotes.
+------------------------------+---------+
| description
| species |
+------------------------------+---------+
| bone morphogenetic protein 7 | Hs |
+-----------------------------------------
1 row in set (0.04 sec)
6. WhatÕs the
expression level of Affy ID 33659_at in experiment hs-liv-1?
SELECT
level FROM Data WHERE affyId=Õ33659_atÕ
AND
exptId=Õhs-liv-1Õ;
7. SELECT
* FROM LocusDescr WHERE description LIKE Òphosphatase%Ó;
Note that the
Ò%Ó (wildcard) matches any character(s).
+--------+-------------+---------+
| linkId | description species |
+--------+-------------+---------+
6 rows returned
8. What are all
the gene descriptions, along with their GenBank IDs, containing the phrase
Ògrowth factorÓ?
SELECT
* FROM Descriptions
WHERE
description LIKE Ò%growth factor%Ó;
9. SELECT
* from Data WHERE exptId = Òmm-hrt-1Ó and level > 6000
ORDER
BY level DESC;
+-----------------------------+----------+-------+
| affyId
|
exptId | level |
+-----------------------------+----------+-------+
73 rows selected
10. What Gene
Ontology terms (and corresponding accessions) contain the phrase Òtranscription
factorÓ? Put your answer in alphabetical order of terms.
SELECT
* FROM GO_Descr
WHERE
description LIKE Ò%transcription factor%Ó
ORDER
BY description;
11. SELECT
Data.affyId, Targets.gbId, Data.level
FROM
Data, Targets
WHERE
Data.affyId = Targets.affyId
ORDER
BY Data.level DESC
LIMIT
10;
+------------+----------+-------+
| affyId | gbId | level |
+------------+----------+-------+
| 36780_at | M25915
| 47914 |
| 39106_at | X02162
| 42001 |
| 33377_at | X03168
| 41726 |
| 36781_at | X01683
| 40854 |
| 35083_at | AL031670 | 38720 |
| 32252_at | D00096
| 35603 |
| 36621_at | M16961
| 34556 |
| 36984_f_at | X89214 | 34311 |
| 39775_at | X54486
| 33765 |
| 36995_at | M88249
| 33553 |
+------------+----------+-------+
10 rows in set (0.52 sec)
12. What Affy
probes correspond to target sequences with the phrase ÒinterleukinÓ in their
description?
SELECT
affyId FROM Targets, Descriptions
WHERE
Targets.gbId=Descriptions.gbId
AND
Descriptions.description LIKE Ò%interleukin%Ó;
13. SELECT
Data.affyId, Targets.gbId, UniSeqs.uId, Data.level
FROM
Data, Targets, UniSeqs
WHERE
Data.affyId = Targets.affyId
AND
Targets.gbId = UniSeqs.gbId
ORDER
BY level DESC LIMIT 10;
+------------+--------+-----------+-------+
| affyId | gbId | uId | level |
+------------+--------+-----------+-------+
| 36780_at | M25915 | Hs.75106 | 47914 |
| 39106_at | X02162 | Hs.93194 | 42001 |
| 33377_at | X03168 | Hs.2257 | 41726 |
| 36781_at | X01683 | Hs.297681 | 40854 |
| 32252_at | D00096 | Hs.194366 | 35603 |
| 36621_at | M16961 | Hs.324746 | 34556 |
| 36984_f_at | X89214 | Hs.328822 | 34311 |
| 93109_f_at | M75718 | Mm.89843 | 33112 |
| 33455_at | X02747 | Hs.234234 | 31473 |
| 35905_s_at | U34995 | Hs.169476 | 31109 |
+------------+--------+-----------+-------+
10 rows in set (0.52 sec)
14. Make a table
of ten affyId, UnigeneID, and Unigene descriptions in reverse alphabetical
order of Unigene descriptions.
SELECT
Targets.affyid, UniDescr.uID, UniDescr.description
FROM
UniDescr, UniSeqs, Targets WHERE Targets.gbId=UniSeqs.gbId AND
UniSeqs.uId=UniDescr.uID ORDER BY UniDescr.description DESC LIMIT 10;
15. SELECT
affyId, AVG(level) AS average FROM Data
GROUP
BY AffyId ORDER BY level DESC LIMIT 10;
Note that AVG()
is an aggregate [like COUNT(), MIN(), MAX(), and SUM()] and ÒASÓ creates a
field with the requested name. Since youÕre combining rows, you have to use
GROUP to show how to do it.
+---------------------------+------------+
| affyId
| average |
+---------------------------+------------+
| 93050_at
| 8176.3333 |
| 101869_s_at
| 10145.3333 |
| 101029_f_at
| 7895.3333 |
| 101071_at
| 7165.6667 |
| AFFX-MURINE_B2_at | 22705.0000 |
| 93514_at | 6720.6667 |
| 100921_at
| 6424.0000 |
| AFFX-GapdhMur/M32599_M_at | 12907.0000 |
| 100614_at
| 5847.0000 |
| 101028_i_at
| 5732.3333 |
+---------------------------+------------+
10 rows in set (3.05 sec)
16. What is the
average expression level of each of the six experiments?
SELECT
exptId, AVG(level) AS average From Data GROUP BY exptId;
17. SELECT
Data.affyId, Data.level, Data.exptId, DataCopy.affyId, DataCopy.level,
DataCopy.exptId
FROM
Data, Data DataCopy
WHERE
Data.level > 10 * DataCopy.level
AND Data.affyId=DataCopy.affyId
AND Data.affyId NOT LIKE
"AFFX%"
LIMIT
10;
(To compare rows
of the same table, we copy the table and compare the table to its copy.)
+------------+-------+----------+------------+-------+----------+
| affyId | level | exptId | affyId | level | exptId |
+------------+-------+----------+------------+-------+----------+
| 31333_at |
16 | hs-cer-1 | 31333_at
| 1 |
hs-liv-1 |
| 31337_at |
20 | hs-cer-1 | 31337_at
| 1 |
hs-hrt-1 |
| 31401_r_at | 13 | hs-cer-1 | 31401_r_at | 1 | hs-hrt-1 |
| 31442_at |
519 | hs-cer-1 | 31442_at
| 20 | hs-liv-1
|
| 31491_s_at | 20 | hs-cer-1 | 31491_s_at | 1 | hs-hrt-1 |
| 31526_f_at | 612 | hs-cer-1 | 31526_f_at | 2 | hs-hrt-1 |
| 31526_f_at | 612 | hs-cer-1 | 31526_f_at | 20 | hs-liv-1 |
| 31536_at | 1914 |
hs-cer-1 | 31536_at | 191 | hs-hrt-1 |
| 31601_s_at | 20 | hs-cer-1 | 31601_s_at | 1 | hs-hrt-1 |
| 31606_at |
56 | hs-cer-1 | 31606_at
| 4 |
hs-hrt-1 |
+------------+-------+----------+------------+-------+----------+
10 rows in set (0.10 sec)
18. List ten
non-control human Affy IDs with higher expression in liver than in heart.
SELECT
Data.affId FROM Data.Data DataCopy
WHERE
Data.level> Datacopy.level
AND
Data.affId=DataCopy.affyId AND Data.exprId=Óhs-liv-1Ó
AND
DataCopy.exptId=Óhs-hrt-1Ó
AND
Data.affyId NOT LIKE ÒAFFX%Ó LIMIT 10;
19. SELECT
Data.affyId, Data.level Heart_level, DataCopy.level Brain_level,
Data.level - DataCopy.level Difference
FROM
Data, Data DataCopy
WHERE
Data.affyId= DataCopy.affyId
AND Data.exptId = Òhs-hrt-1Ó
AND DataCopy.exptId = Òhs-cer-1Ó
ORDER
BY Difference DESC
LIMIT
10;
+------------+-------------+-------------+------------+
| affyId | Heart_level | Brain_level | Difference |
+------------+-------------+-------------+------------+
| 32485_at | 26961 | 20 | 26941 |
| 31737_at | 26965 | 57 | 26908 |
| 41731_g_at | 26860 | 20 | 26840 |
| 39063_at | 26574 | 61 | 26513 |
| 36640_at | 25330 |
236 |
25094 |
| 38660_at | 23964 |
115 |
23849 |
| 39031_at | 24492 | 823 | 23669 |
| 31687_f_at | 24020 |
3035 |
20985 |
| 970_r_at | 23390 |
2890 |
20500 |
| 41730_at | 19780 | 43 | 19737 |
+------------+-------------+-------------+------------+
10 rows in set (1.66 sec)
20. SELECT Data.affyId, Targets.gbId,
LocusLinks.
linkId, LocusDescr.description,
Data.level Heart_level, DataCopy.level
Brain_level,
Data.level - DataCopy.level Difference
FROM
Data, Data DataCopy, LocusLinks, LocusDescr,
Targets
WHERE
Data.affyId= DataCopy.affyId
AND Data.exptId = Ôhs-hrt-1Õ
AND DataCopy.exptId = Ôhs-cer-1Õ
AND Data.affyId = Targets.affyId
AND Targets.gbId = LocusLinks.gbId
AND LocusLinks.linkId = LocusDescr.linkId
ORDER
BY Difference DESC
LIMIT
10;
|
affyId | gbId | linkId | description | Heart_level | Brain_level | Difference |