Guidelines for Documenting Databases: Dataset Information The purpose of this page is to provide detailed information on a particular data set to enable other researchers to use the data for a variety of analysis tasks. For example, a data page might describe census data which could then be used for different analysis tasks such as classification or clustering. When filling out this form, simply place your answer after the point indicated by '>'. We will then process the form to ensure that all documentation files follow a common format. 1. Title of Database -- Indicate the central topic of the domain. > Anonymous web data from www.microsoft.com 2. Data Type -- Indicate the type of data: image, multivariate, relational, sequence, spatial, text, time series, transaction -- If the data is heterogenous, list all relevant types. > relational, multivariate 3. Abstract -- A short one or two sentence description of the data (for use in summary pages). > This dataset records which areas (Vroots) of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998. 4. Sources (a) Original owners of database (name/snail address/phone/email/homepage) > Jack S. Breese, David Heckerman, Carl M. Kadie Microsoft Research, Redmond WA, 98052-6399, USA breese@microsoft.com, heckerma@microsoft.com, carlk@microsoft.com (b) Donor of database (name/snail address/phone/email/homepage) > 5. Data Characteristics -- Describe how and why the data was collected. -- Provide the date and location of the data collection. -- distinguish between single and multiple times/locations -- Describe the nature of the measurements. For example, are the data physical measurements, derived variables, from an opinion poll, etc. -- Are there any known systematic biases in the measurements? (e.g., camera characteristics for image data) -- Are there any known characteristics of random variation? (e.g., is there a model for instrument noise?) -- Did the data arise from: (i) observational data (i.e. data collected routinely) (ii) designed experiment (iii) designed sampling strategy (iv) census, (i.e. the data collected on all individuals/objects) (v) other, (please describe) -- Was the data preprocessed? if so, describe the preprocessing. -- e.g. was a dimensionality reduction technique like PCA applied? was feature selection applied? were missing values imputed? were the temporal or spatial aspects removed? -- Include any other information that you believe is important. > The data was created by sampling and processing the www.microsoft.com logs. The data records the use of www.microsoft.com by 38000 anonymous, randomly-selected users. For each user, the data lists all the areas of the web site (Vroots) that the user visited in a one week timeframe. Users are identified only by a sequential number, for example, User #14988, User #14989, etc. The file contains no personally identifiable information. The 294 Vroots are identified by their title (e.g. "NetShow for PowerPoint") and URL (e.g. "/stream"). The data comes from one week in February, 1998. Number of Instances: -- Training: 32711 -- Testing: 5000 Each instance represents an anonymous, randomly selected user of the web site. Number of Attributes: 294 Each attribute is an area ("vroot") of the www.microsoft.com web site. Missing Attribute Values: The data is very sparse, so vroot visits are explicit, nonvisits are implicit (missing). -- Depending on the data type, also describe the following: All Types: (a) Missing Values: -- where possible describe why the values are missing -- missing at random, missing not at random, not applicable, unknown, don't cares > (b) Censored data: indicate if any of the data has been censored. > (c) Cost Information: (if applicable/available) -- e.g. cost to measure a variable (feature,attribute) > (d) Dependencies: are there any known dependencies between cases. For example, do instances represent the same object viewed from different angles. > Multivariate: (a) for each variable (feature/attribute): (i) give a name (ii) provide typing information (i.e. is the variable categorical (ordered, unordered) or continuous). -- Be careful to distinguish categorical values that have been encoded numerically. (iii) if the variable is categorical, list and describe the possible values it may take on. > Time Series/Sequences: (a) multivariate vs. univariate > (b) annotated: are the data annotated and if so with what information? > (c) anomalies: saturation, drop outs, missing values... > Text: (a) language: (e.g. english, french, spanish,...) > (b) annotated: are the data annotated and if so with what information? > (c) structured: is the text structured in any manner (e.g. is the text html, or email) > (d) preprocessing: case, punctuation, stoplisted > Image or Spatial: (a) information depth (e.g. 8 bits per pixel/point) > (b) annotated: are the data annotated and if so with what information? > (c) resolution or grid size > (d) for 3D data (e.g. MRIS) resolutions in all dimensions > 6. Other Relevant Information -- Include any additional information about the data that the researcher may find useful. (Note there is a separate document for including task specific information.) For example: (a) Prior Knowledge -- Describe the types of prior knowledge that could be used, (provide references if possible). For example: (i) transformations of the data (ii) class hierarchies (iii) known relevant and irrelevant variables (iv) constraints: e.g. variables could be positive or restricted to a certain range. > Mean number of vroot visits per case: 3.0 7. Data Format -- Describe how the data is stored in the archive. List the files, their contents, and their format (if not in plain text). -- Please describe thoroughly any data formats used which are unlikely to be known by non specialists. If necessary, use a separate file to describe the format. -- If there are different versions of the data, please describe the differences. > The data is in an ASCII-based sparse-data format called "DST". Each line of the data file starts with a letter which tells the line's type. The three line types of interest are: -- Attribute lines: For example, 'A,1277,1,"NetShow for PowerPoint","/stream"' Where: 'A' marks this as an attribute line, '1277' is the attribute ID number for an area of the website (called a Vroot), '1' may be ignored, '"NetShow for PowerPoint"' is the title of the Vroot, '"/stream"' is the URL relative to "http://www.microsoft.com" Case and Vote Lines: For each user, there is a case line followed by zero or more vote lines. For example: C,"10164",10164 V,1123,1 V,1009,1 V,1052,1 Where: 'C' marks this as a case line, '10164' is the case ID number of a user, 'V' marks the vote lines for this case, '1123', 1009', 1052' are the attributes ID's of Vroots that a user visited. '1' may be ignored. 8. Past Usage -- Include references that discuss the data itself and how it was collected. These references could be either classic and widely cited papers that were the first to announce the data and its characteristics, or possibly a domain specific web site. > J. Breese, D. Heckerman., C. Kadie _Empirical Analysis of Predictive Algorithms for Collaborative Filtering_ Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, July, 1998. 9. Acknowledgements, Copyright Information, and Availability (a) copyright information > (b) usage restrictions (e.g. for reseach only) > (c) citation requests > (d) acknowledgements > 10. References & Further Information -- Include here references to additional information that describes the data itself. (Note there is another document for references that describe analyses of the data). (a) pointers to tutorial/background information on the domain (b) other useful web sites (parent archives, domain specific sites) (c) where to get more data (if this is a subsample) (e) other relevant publications (f) any additional comments on this dataset >