Relational Databases for Biologists

Session 2: Mining a database with SQL

Solutions to Exercise 2: Mining a database with SQL

Goals: Learn how to query a MySQL database to extract selected data relationships. Learn how to convert a question (in English) into a SQL query and vice versa.

1.

SELECT * from Sources;
Get every attribute from each tuple of the Sources table. In other words, print all fields of every line in the Sources table.

2. What are some example GenBank IDs (accessions) and corresponding sequence descriptions?

SELECT * FROM Descriptions LIMIT 5;

3.

SELECT count(*) FROM LocusLinks;
How many rows are in table LocusLinks? 70153

4. How many different Affy IDs are in the expression data?

SELECT COUNT(DISTINCT affyId) FROM Data;
How is this different from "SELECT COUNT(affyId) FROM Data"?

5.

SELECT description, species FROM LocusDescr
WHERE linkId = 655;
What is the description and species of the LocusLink gene with ID = 655?

6. What's the expression level of Affy ID 33659_at in experiment hs-liv-1?

SELECT level FROM Data WHERE affyId="33659_at"
AND exptId="hs-liv-1";

7.

SELECT * FROM LocusDescr WHERE description LIKE "phosphatase%";
What LocusLink descriptions start with the word "phosphatase"?

Extra credit: Is this any different from
SELECT * FROM LocusDescr WHERE description LIKE "%phosphatase%" ?

8. What are all the gene descriptions, along with their GenBank IDs, containing the phrase "growth factor"?

SELECT * FROM Descriptions
WHERE description LIKE "%growth factor%";

9.

SELECT * from Data WHERE exptId = "mm-hrt-1" and level > 6000
ORDER BY level DESC;
Select all rows from Data for the "mm-hrt-1" hydridization (mouse heart) with an expression level above 6000, and order the results by decreasing expression level.

10. What Gene Ontology terms (and corresponding accessions) contain the phrase "transcription factor"? Put your answer in alphabetical order of terms.

SELECT * FROM GO_Descr
WHERE description LIKE "%transcription factor%"
ORDER BY description;

11.

SELECT Data.affyId, Targets.gbId, Data.level
FROM Data, Targets
WHERE Data.affyId = Targets.affyId
ORDER BY Data.level DESC
LIMIT 10;
Join fields from the Data and Targets tables (linking rows with the same Affy IDs), and order these by decreasing expression level.

12. What Affy probes correspond to target sequences with the phrase "interleukin" in their description?

SELECT affyId FROM Targets, Descriptions
WHERE Targets.gbId=Descriptions.gbId
AND Descriptions.description LIKE "%interleukin%";

13.

SELECT Data.affyId, Targets.gbId, UniSeqs.uId, Data.level
FROM Data, Targets, UniSeqs
WHERE Data.affyId = Targets.affyId
AND Targets.gbId = UniSeqs.gbId
ORDER BY level DESC
LIMIT 10;
Join fields from the Data, Targets, and UniSeqs tables, and order these by decreasing expression level.

14. Make a table of ten affyId, UnigeneID, and Unigene descriptions in reverse alphabetical order of Unigene descriptions.

SELECT Targets.affyid, UniDescr.uID, UniDescr.description
FROM UniDescr, UniSeqs, Targets
WHERE Targets.gbId=UniSeqs.gbId
AND UniSeqs.uId=UniDescr.uID
ORDER BY UniDescr.description DESC
LIMIT 10;

15.

SELECT affyId, AVG(level) AS average FROM Data
GROUP BY affyId
ORDER BY level DESC
LIMIT 10;
Get the average expression level of each array probe (affyId) across all tissues (experiments).

16. What is the average expression level of each of the six experiments?

SELECT exptId, AVG(level) AS average
From Data
GROUP BY exptId;

17.

SELECT Data.affyId, Data.level, Data.exptId, DataCopy.affyId, DataCopy.level, DataCopy.exptId
FROM Data, Data DataCopy
WHERE Data.level > 10 * DataCopy.level
AND Data.affyId=DataCopy.affyId
AND Data.affyId NOT LIKE "AFFX%"
LIMIT 10;
Find the microarray probes for which one tissue/experiment has an expression level that's at least 10 times higher than another tissue/experiment, and print ten of these.

18. List ten non-control human Affy IDs (IDs without "AFFX") with higher expression in liver than in heart. SELECT Data.affyId FROM Data.Data DataCopy WHERE Data.level> Datacopy.level AND Data.affId=DataCopy.affyId AND Data.exprId="hs-liv-1" AND DataCopy.exptId="hs-hrt-1" AND Data.affyId NOT LIKE "AFFX%" LIMIT 10;

19.

SELECT Data.affyId,

Data.level AS Heart_level, DataCopy.level AS Brain_level,
Data.level - DataCopy.level AS Difference
FROM Data, Data DataCopy
WHERE Data.affyId= DataCopy.affyId
AND Data.exptId = "hs-hrt-1"
AND DataCopy.exptId = "hs-cer-1"
ORDER BY Difference DESC
LIMIT 10;
List all probe sets (printing only the first ten) in order of the difference between expression level in the heart and the brain.

20.

SELECT Data.affyId, Targets.gbId, LocusLinks.
linkId, LocusDescr.description,
Data.level as Heart_level,
DataCopy.level as Brain_level,
Data.level - DataCopy.level AS Difference
FROM Data, Data DataCopy, LocusLinks, LocusDescr, Targets
WHERE Data.affyId= DataCopy.affyId
AND Data.exptId = "hs-hrt-1"
AND DataCopy.exptId = "hs-cer-1"
AND Data.affyId = Targets.affyId
AND Targets.gbId = LocusLinks.gbId
AND LocusLinks.linkId = LocusDescr.linkId
ORDER BY Difference DESC
LIMIT 10;
List all probe sets (printing only the first ten) in order of the difference between expression level in the heart and the brain, including annotation about the probe sets.


Questions or comments?   gbell@wi.mit.edu
Bioinformatics and Research Computing at Whitehead Institute