Guidelines for Documenting Data Sets: DATA SET INFORMATION The purpose of this page is to provide detailed information on a particular data set to enable other researchers to use the data for a variety of analysis tasks. For example, a data page might describe census data which could then be used for different analysis tasks such as classification or clustering. When filling out this form, simply place your answer after the point indicated by '>'. We will then process the form to ensure that all documentation files follow a common format. 1. Title of Data Set -- Indicate the central topic of the domain. > Volcanoes on Venus - JARtool experiments 2. Data Type -- Indicate the type of data: multivariate, relational, time seris, sequential, images, spatial, text, time series, transactional, web data -- If the data is heterogenous, list all relevant types. > images 3. Abstract -- A short one or two sentence description of the data (for use in summary pages). > The JARtool project was a pioneering effort to develop an automatic > system for cataloging small volcanoes in the large set of Venus > images returned by the Magellan spacecraft. This package contains > a variety of data to enable researchers to evaluate algorithms over > the same images as used for the JARtool experiments reported in [Burl98]. 4. Sources (a) Original owners of database (name/snail address/phone/email/homepage) > Michael C. Burl > MS 126-347, JPL > 4800 Oak Grove Drive > Pasadena, CA 91109 > (818) 393-5345 > Michael.C.Burl@jpl.nasa.gov > http://www-aig.jpl.nasa.gov/mls/home/burl/ (b) Donor of database (name/snail address/phone/email/homepage) > same 5. Data Characteristics -- Describe how and why the data was collected. -- Provide the date and location of the data collection. -- distinguish between single and multiple times/locations -- Describe the nature of the measurements. For example, are the data physical measurements, derived variables, from an opinion poll, etc. -- Are there any known systematic biases in the measurements? (e.g., camera characteristics for image data) -- Are there any known characteristics of random variation? (e.g., is there a model for instrument noise?) -- Did the data arise from: (i) observational data (i.e. data collected routinely) (ii) designed experiment (iii) designed sampling strategy (iv) census, (i.e. the data collected on all individuals/objects) (v) other, (please describe) -- Was the data preprocessed? if so, describe the preprocessing. -- e.g. was a dimensionality reduction technique like PCA applied? was feature selection applied? were missing values imputed? were the temporal or spatial aspects removed? -- Include any other information that you believe is important. > The data was collected by the Magellan spacecraft over an > approximately four year period from 1990--1994. The objective of the > mission was to obtain global mapping of the surface of > Venus using synthetic aperture radar (SAR). A mroe detailed discussion > of the mission and objectives is available at http://www.jpl.nasa.gov/magellan/. -- Depending on the data type, also describe the following: All Types: (a) Missing Values: -- where possible describe why the values are missing -- missing at random, missing not at random, not applicable, unknown, don't cares > Some images contain blank (black) regions which resulted from gaps > in the Magellan acquisition or communication processes. These regions > can generally be ignored. (b) Censored data: indicate if any of the data has been censored. > No. (c) Cost Information: (if applicable/available) -- e.g. cost to measure a variable (feature,attribute,field) > (d) Dependencies: are there any known dependencies between cases. For example, do instances represent the same object viewed from different angles. > There are some spatial dependencies. For example, background patches from within > a single image are likely to be more similar than background patches taken across > different images. Multivariate: (a) for each variable (feature/attribute/field): (i) give a name (ii) provide typing information (i.e. is the variable categorical (ordered, unordered) or continuous). -- Be careful to distinguish categorical values that have been encoded numerically. (iii) if the variable is categorical, list and describe the possible values it may take on. > The images are 1024X1024 pixels. The pixel values are in the range [0,255]. > The pixel value is related to the amount of energy backscattered to the > radar from a given spatial location. Higher pixel values indicate greater > backscatter. Lower pixel values indicate lesser backscatter. Both topography > and surface roughness relative to the radar wavelength affect the amount > of backscatter. Time Series/Sequences: (a) multivariate vs. univariate > (b) annotated: are the data annotated and if so with what information? > (c) anomalies: saturation, drop outs, missing values... > (d) other: e.g. sampling rates, uniform or non-uniform sampling, etc. > Text: (a) language: (e.g. english, french, spanish,...) > (b) annotated: are the data annotated and if so with what information? > (c) structured: is the text structured in any manner (e.g. is the text html, or email) > (d) preprocessing: case, punctuation, stoplisted > Image or Spatial: (a) information depth (e.g. 8 bits per pixel/point) > 8 bits per pixel (b) annotated: are the data annotated and if so with what information? > In addition to the images, there are "ground truth" files that > specify the locations of volcanoes within the images. The quotes > around "ground truth" are intended as a reminder that there is no > absolute ground truth for this data set. No one has been to Venus > and the image quality does not permit 100%, unambiguous > identification of the volcanoes, even by human experts. There are labels that > provide some measure of subjective uncertainty (1 = definitely a > volcano, 2 = probably, 3 = possibly, 4 = only a pit is visible). > See reference [Smyth95] for more information on the labeling > uncertainty problem. > > There are also files that specify the exact set of experiments using in the > published evaluations of the JARtool system. > (c) resolution or grid size > Each image is 1024 X 1024. The pixel spacing is 75 meters in both dimensions. (d) for 3D data (e.g. MRIS) resolutions in all dimensions > N/A 6. Other Relevant Information -- Include any additional information about the data that the researcher may find useful. (Note there is a separate document for including task specific information.) For example: Prior Knowledge -- Describe the types of prior knowledge that could be used, (provide references if possible). For example: (i) transformations of the data (ii) class hierarchies (iii) known relevant and irrelevant variables (iv) constraints: e.g. variables could be positive or restricted to a certain range. > 7. Data Format -- Describe how the data is stored in the archive. List the files, their contents, and their format (if not in plain text). -- Please describe thoroughly any data formats used which are unlikely to be known by non specialists. If necessary, use a separate file to describe the format. -- If there are different versions of the data, please describe the differences. For example, are there specific training and test sets that commonly used in the literature. > The image files are in a format called VIEW. This format consists of > two files, a binary file with extension .sdt (the image data) and an ascii > file with extension .spr (header information). There is a MATLAB utility > function included in the data package that can be used to read the data. > If you want to use something other than Matlab, you are on your own, but > the format is fairly simple and can be understood by looking at the > Matlab code. > > The labeling files are provided in two forms. The .lxyr files are simple > space-separated ascii containing label, x-location of center, y-location > of center, and radius. 8. Past Usage -- Include references that discuss the data itself and how it was collected. These references could be either classic and widely cited papers that were the first to announce the data and its characteristics, or possibly a domain specific web site. >o http://www.jpl.nasa.gov/magellan/ > > o G.H. Pettengill, P.G. Ford, W.T.K. Johnson, R.K. Raney, L.A. Soderblom, > "Magellan: Radar Performance and Data Products", Science, 252:260-265 (1991). > > o R.S. Saunders, A.J. Spear, P.C. Allin, R.S. Austin, A.L. Berman, R.C. Chandlee, > J. Clark, A.V. Decharon, E.M. Dejong, "Magellan Mission Summary", J. of > Geophysical Research Planets, 97(E8):13067-13090, (1992). > > o M.C. Burl, L. Asker, P. Smyth, U. Fayyad, P. Perona, > L. Crumpler, and J. Aubele, "Learning to Recognize > Volcanoes on Venus", Machine Learning, (March 1998). > > o P. Smyth, M.C. Burl, U.M. Fayyad, and P. Perona, > Chapter: "Knowledge Discovery in Large Image Databases: > Dealing with Uncertainties in Ground Truth", In Advances > in Knowledge Discovery and Data Mining, AAAI/MIT Press, > Menlo Park, CA, (1995). > > o http://www-aig.jpl.nasa.gov/mls/mgn-sar/ > 9. Acknowledgements, Copyright Information, and Availability (a) copyright information > (b) usage restrictions (e.g. for reseach only) > Anyone seeking to publish results on this data should perform at a > minimum the *FULL SUITE* of experiments defined in the > Experiments_Images_Table included in the dataset and compare > performance to that of the baseline JARtool system [Burl98]. Results > on a small subset of images (e.g., HOM4) are not of interest and > *SHOULD NOT BE PUBLISHED*. Send e-mail to jartool@aig.jpl.nasa.gov > if you believe you have a compelling reason to deviate from this > policy. We would also appreciate it if you would report your results > (whether POSITIVE OR NEGATIVE) along with a brief description of > your algorithm to jartool@aig.jpl.nasa.gov. (c) citation requests > (d) acknowledgements > Assembly of this dataset has been carried out in part by the Jet Propulsion > Laboratory, California Institute of Technology, under contract with the > National Aeroenautics and Space Administration. 10. References & Further Information -- Include here references to additional information that describes the data itself. (Note there is another document for references that describe analyses of the data). (a) pointers to tutorial/background information on the domain (b) other useful web sites (parent archives, domain specific sites) (c) where to get more data (if this is a subsample) (e) other relevant publications (f) any additional comments on this dataset > Refer to the README file included with the dataset for more complete > information. > > The complete Magellan imageset is available in a large (~150) CD-ROM collection. > Consult the NASA CD-ROM catalog at http://nssdc.gsfc.nasa.gov/cd-rom/cd-rom.html > for additional information.