Data giving characteristics of each ORF (potential gene) in the E. coli genome. Sequence, homology (similarity to other genes) and structural information, and function (if known) are provided.
Ross D. King Department of Computer Science, University of Wales Aberystwyth, SY23 3DB, Wales rdk@aber.ac.ukDate Donated: July 14, 2001
The data was collected from several sources, including GenProtEC and SWISSPROT. Structure prediction was made by PROF. Homology search was provided by PSI-BLAST.
The data is in Datalog format. Missing values are not explicit, but some genes have more relationships than others.
E. coli genes (ORFs) are related to each other by the predicate ecoli_to_ecoli(EcoliNumber,E-value,Psi-blast_iteration). They are related to other (SWISSPROT) proteins by the predicate e_val(AccNo,E-value). All the data for a single gene (ORF) is enclosed between delimiters of the form:
begin(model(EcoliNumber)). end(model(EcoliNumber)).
Lists classes and ORF functions. Lines are of the following form:
class(5,1,1,'Colicin-related functions'). class(5,1,'Laterally acquirred elements'). class(5,'Extrachromosomal').
Arguments are up to 3 numbers (describing class at up to 3 different levels), followed by a string class description. For example:
function(ecoli210,7,0,0,'b0217','putative aminopeptidase').
Arguments are ORF number, exactly 3 class numbers, gene name (or blattner number if no gene name), ORF description.
Data for each ORF (gene) is delimited by
begin(model(ecoliX)). end(model(ecoliX)).where X is the ORF number. Other predicates are as follows (examples):
ecoli_orf(ecoliX). % X is ORF number
ecoli_mol_wt(176624.1). % float
ecoli_theo_pI(5.81). %float
ecoli_atomic_comp(c,7940). % {c,h,n,o,s} , int
ecoli_aliphatic_index(69.57). % float
ecoli_hydro(-0.549). % float
sec_struc(1,c,2). % int (start), {a,b,c}, int (length)
sec_struc_coil(1,2). % int (start), int (length)
sec_struc_beta(1,5). % int (start), int (length)
sec_struc_alpha(1,7). % int (start), int (length)
sequence_length(255). % int
amino_acid_ratio(a,8.9). % amino_acid_char, float
amino_acids(ecoli3013,a,70). % ORF_num, amino_acid_char, int
amino_acid_pair_ratio(a,a,9.0). % amino_acid_char, amino_acid_char, float
amino_acid_pairs(a,a,7). % amino_acid_char, amino_acid_char, int
ecoli_to_ecoli(1170,1.0e-105,5). % ORF_num, double (e-value), int (iteration)
e_val(o42893,2.0e-99). % accession_number, double (e-value)
psi_iter(o42893,5). % accession_number, int (iteration)
species(p52494,'candida_albicans__yeast_'). % accession_number, string
mol_wt(p52494,104022). % accession_number, int
classification(p52494,candida). % accession_number, name
keyword(p25195,'plasmid'). % accession_number, string
King, R. D., Karwath, A., Clare, A. and Dehaspe, L. (2000) Accurate prediction of protein functional class in the M. tuberculosis and E. coli genomes using data mining. Journal of Comparative and Functional Genomics, 17, p283-293.
Copyright 2000 by R. D. King, A. Karwath, A. Clare, L. Dehaspe
There are no restrictions data usage. This data is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantibility and fitness for a particular purpose.
Please cite King et al. (2000).
This work was supported by the following grants: G78/6609, BIF08765, GR/L62849 and by PharmaDM, Ambachtenlaan, 54/D, B-3001 Leuven, Belgium
King, R. and Karwath, A. and Clare, A. and Dehaspe, L. (2001). The Utility of Different Representations of Protein Sequence for Predicting Functional Class, Bioinformatics, 17(5), pages 445--454.