The data consists of the upstream region for each gene in yeast. There are over 6000 genes in Yeast and the upstream regions are about 500 base pairs long, ie. a sequence of characters from {A,C,T,G}.
Representation of the data, as always, is crucial. Unfortunately the best representation is not known, however one should maintain as much of the biological knowledge/constraints as possible. In particular regulatory sites are strings of length 5-12 where the base at a position is variability.
Data found by biological experiment. There may be some errors, but probably less than 1%. No missing values.
Number of Instances: 6000+
Each instance is defined, in raw terms, by a sequence of about between 400 and 1000 base pairs. The belief is that patterns will have length no larger than 12. One may introduce various higher level features, which would change the number of attributes.
The goal is to identify patterns in the strings which are common to specific family of genes and uncommon elsewhere. There are various languages for describing the patterns which permit expressing known constraints. One pattern language is strings of length between 6 and 12 over {a,c,g,t}. Another pattern language, which has been found to a better representation of reality, is probability matrices of length 6 to 12.
Helden, Andre, Collado-Vides Extracting Regulatory Sites from the Upstream Region of Yeast Genes by Computational Analysis of Oligonucleotides Frequences. Journal Molecular Biology 1998 281, 827-842
Paper adopts the language of strings of length 6 over {a,c,g,t} and shows that those strings that are occur more frequently than expected, match known regulatory elements. In particular they examine 10 family of genes and find strings that partially match the known regulatory sites in most cases. Real regulatory regions are known to sometimes be longer than 6 and also know to have variability in their base pair constituency.
Identification of consensus patterns in unaligned DNA sequences known to be functionally related Cabios 1990 Herzt, Hartzell, Stormo
Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization ML95 (21) Bailey and Elkan
Identification of Consensus Patterns in Unaligned DNA and protein sequence: a large-deviation statistical basis for penalizing gaps. Hertz, Stormo 1995 3rd International Conference on BioInformatics and Genomic Research.
Detecting Subtle Sequence Signals: A Gibbs Sampling strategy for Multiple Alignment Lawrence, Altshcul, Boguski, Liu, Neuwald, Wootton Science 1993
Extracting Regulatory Sites from the Upstream Region of Yeast Genes by Computational Analysis of Oligonucleotides Frequences JMB 1998 J.van Helden, B. Andre and J. Collado-Vides
These sites contain additional data, tutorials, and pointers to the literature: http://genome-www.stanford.edu/Saccharomyces/ http://copan.cifn.unam.mx/Computational_Biology/yeast-tools/