>>>The sql chain puzzle, to build a map of gene collisions.

The "known" genes in the genome are listed in table knownGene. For each gene (row)
there is a genomic interval or locus with coordinates (chrN, Start position, end
position). For each chromosome (1-22, X, Y) there is a table of entities called
chains. These each have two intervals, target and query. A chain is a repeat.

The plan is to make a list for each known gene. The list is named by its target
gene, and populated by the names and descriptions of all the genes associated via
the chains. A chain is a link, or edge, from one region to another in the genome.

For each knowngene, all the chainSelf tables must be scanned for chains whose
target coordinates indicate they would fall into the genes genomic interval. Each
genes interval will have to be expanded by 1.5, but that can wait. With a gene's
interval in mind, a list of chains that are enclosed by or that overlap it can be
made. Each of these chains has a query locus associated with it. QNAME is the
chromosome, QSTART, QEND define the interval. For each query interval, table
knownGene is searched and every overlapping knowngene returned. These are the
known genes that are linked to or "chain to" the initial target gene for which the
list is named. They constitute the first of two columns in this is gene,s list.
The second column is populated by taking the name in knownGene and scanning table
kgXref (were name is equivilent to kgID) to return all teh descriptions of the
genes whose names we collected for this one target gene. I am not particularly
clear on what the best final output is. Suffice to say I would want to sort the
lists by size and see what genes chain to the greatest number of other genes, and
then make a map of the network created by these associations using something like
Cytoscape to show that the distribution of links follows a power law.

>>>Here are the descriptions of the 3 table types:

SQL> describe knownGene
Name Null? Type
----------------------------------------- -------- ----------------------------
NAME NOT NULL VARCHAR2(255)
CHROM NOT NULL VARCHAR2(255)
STRAND NOT NULL CHAR(1)
TXSTART NOT NULL NUMBER(10)
TXEND NOT NULL NUMBER(10)
CDSSTART NOT NULL NUMBER(10)
CDSEND NOT NULL NUMBER(10)
EXONCOUNT NOT NULL NUMBER(10)
PROTEINID NOT NULL VARCHAR2(40)
ALIGNID NOT NULL VARCHAR2(8)

>knownGene looks like this, but with no headers. Values abbreviated to fit page here.
>Table knownGene has 43232 rows total.
#name chrom strand txStart txEnd cdsStart cdsEnd exonCnt proteinID alignID
BC0634 chr1 - 4224 7502 4558 7173 2 AAH63682 24255
AL1373 chr1 - 4266 1936 7413 19346 2 Q9NSV7 41739
BC0544 chr1 - 4268 5808 4558 5808 1 Q96BN3 32430


SQL> describe chr10_chainSelf (there is one table for each chromosome 1-22, X, Y)
Name Null? Type
----------------------------------------- -------- ----------------------------
BIN NOT NULL NUMBER(5)
SCORE NOT NULL FLOAT(126)
TNAME NOT NULL VARCHAR2(255)
TSIZE NOT NULL NUMBER(10)
TSTART NOT NULL NUMBER(10)
TEND NOT NULL NUMBER(10)
QNAME NOT NULL VARCHAR2(255)
QSIZE NOT NULL NUMBER(10)
QSTRAND NOT NULL CHAR(1)
QSTART NOT NULL NUMBER(10)
QEND NOT NULL NUMBER(10)
ID NOT NULL NUMBER(10)

>chr10_chainSelf looks like this with no headers.
#bin score tName tSize tStart tEnd qName qSize qStrand qStart qEnd id
585 63520 chr10 135037 14000 16578 chr7 158545 - 96339 9634 204969
585 63670 chr10 135037 14000 16578 chr7 158545 + 62286 6228 204161
585 81668 chr10 135037 14000 17239 chr7 158545 + 62827 6283 133728


SQL> describe kgXref
Name Null? Type
----------------------------------------- -------- ----------------------------
KGID NOT NULL VARCHAR2(40)
MRNA VARCHAR2(40)
SPID VARCHAR2(40)
SPDISPLAYID VARCHAR2(40)
GENESYMBOL VARCHAR2(40)
REFSEQ VARCHAR2(40)
PROTACC VARCHAR2(40)
DESCRIPTION VARCHAR2(255)

>kgXref looks like this but with no headers. Values have been truncated.
>Table kgXref has 42026 rows total.
#kgID mRNA spID spDisplayID geneSymbol refseq protAcc description
AB0000 AB0000 P56555 DSR4_HUMAN DSCR4 NM_0057 NP_0058 Down syndrome gene 4
AB0001 AB0001 Q99983 OMD_HUMAN OMD NM_0050 NP_0050 osteomodulin
AB0001 AB0001 Q99984 Q99984 C1orf29 NM_0068 NP_0068 mRNA expr in osteoblast


>>>this is the query that was given to me. It can only check through one of the
chainself tables instead of all 24 (chr1-22, X, Y). It returns the same results
over and over. I dont think it ever stops. I have been sending the results to a
spool file, and clearly the query doesnt do what I intended.

/* query1.sql*/

select a.name, a.exoncount, d.description, c.name
from knowngene a,
chr1_chainSelf b,
knownGene c,
kgXref d

where a.txstart <= b.tStart
and a.txEnd >= b.tEnd
and b.qstart >= c.Txstart
and b.qEnd <= c.txend
and c.name = d.kgId
and a.chrom = 'chr1'
;

>>> I am using Oracle9i and SQLplus