Guidelines for Documenting Databases: Dataset Information

The purpose of this page is to provide detailed information on a particular
data set to enable other researchers to use the data for a variety of analysis 
tasks. For example, a data page might describe census data which could then be
used for different analysis tasks such as classification or clustering.

When filling out this form, simply place your answer after the point indicated
by '>'. We will then process the form to ensure that all documentation files 
follow a common format.


1. Title of Database
   -- Indicate the central topic of the domain.
> 
Anonymous web data from www.microsoft.com

2. Data Type
   -- Indicate the type of data: image, multivariate, relational, sequence, 
      spatial, text, time series, transaction
   -- If the data is heterogenous, list all relevant types.
>
relational, multivariate

3. Abstract
   -- A short one or two sentence description of the data (for use in summary 
      pages).
>
This dataset records which areas (Vroots) of www.microsoft.com each user
visited in a one-week timeframe in Feburary 1998.

4. Sources
   (a) Original owners of database (name/snail address/phone/email/homepage)
>
Jack S. Breese, David Heckerman, Carl M. Kadie
Microsoft Research, Redmond WA, 98052-6399, USA
breese@microsoft.com, heckerma@microsoft.com, carlk@microsoft.com

   (b) Donor of database (name/snail address/phone/email/homepage)
>

5. Data Characteristics
   -- Describe how and why the data was collected.
   -- Provide the date and location of the data collection.
        -- distinguish between single and multiple times/locations 
   -- Describe the nature of the measurements. For example, are the
      data physical measurements, derived variables, from an opinion
      poll, etc.
   -- Are there any known systematic biases in the measurements? (e.g.,
      camera characteristics for image data)
   -- Are there any known characteristics of random variation? (e.g., is
      there a model for instrument noise?)
   -- Did the data arise from:
        (i) observational data (i.e. data collected routinely)
       (ii) designed experiment
      (iii) designed sampling strategy
       (iv) census, (i.e. the data collected on all individuals/objects)
        (v) other, (please describe)
   -- Was the data preprocessed? if so, describe the preprocessing.
        --  e.g. was a dimensionality reduction technique like PCA applied? 
                 was feature selection applied?
                 were missing values imputed?
                 were the temporal or spatial aspects removed?
   -- Include any other information that you believe is important.
>
The data was created by sampling and processing the www.microsoft.com logs.
The data records the use of www.microsoft.com by 38000 anonymous,
randomly-selected users. For each user, the data lists all the areas of
the web site (Vroots) that the user visited in a one week timeframe.

Users are identified only by a sequential number, for example, User #14988,
User #14989, etc. The file contains no personally identifiable information.
The 294 Vroots are identified by their title (e.g. "NetShow for PowerPoint")
and URL (e.g. "/stream"). The data comes from one week in February, 1998.

Number of Instances:
 -- Training: 32711
 -- Testing:   5000
Each instance represents an anonymous, randomly selected user of the web
site.

Number of Attributes: 294
Each attribute is an area ("vroot") of the www.microsoft.com web site.

Missing Attribute Values: The data is very sparse, so vroot visits are
explicit, nonvisits are implicit (missing).


   -- Depending on the data type, also describe the following:

   All Types:
     (a) Missing Values:
         -- where possible describe why the values are missing
         -- missing at random, missing not at random, not applicable,
            unknown, don't cares 
>
     (b) Censored data: indicate if any of the data has been censored.
>
     (c) Cost Information: (if applicable/available)
         -- e.g. cost to measure a variable (feature,attribute)
>
     (d) Dependencies: are there any known dependencies between cases. For 
         example, do instances represent the same object viewed from different 
         angles.
>

   Multivariate:
     (a) for each variable (feature/attribute):
           (i) give a name 
          (ii) provide typing information (i.e. is the variable categorical
               (ordered, unordered) or continuous).
               -- Be careful to distinguish categorical values that have been
                  encoded numerically.
         (iii) if the variable is categorical, list and describe the possible 
               values it may take on.
>

   Time Series/Sequences:
     (a) multivariate vs. univariate
>
     (b) annotated: are the data annotated and if so with what information?
>
     (c) anomalies: saturation, drop outs, missing values...
>

   Text:
     (a) language: (e.g. english, french, spanish,...)
>
     (b) annotated: are the data annotated and if so with what information?
>
     (c) structured: is the text structured in any manner (e.g. is the
         text html, or email) 
>
     (d) preprocessing: case, punctuation, stoplisted
>

   Image or Spatial:
     (a) information depth (e.g. 8 bits per pixel/point)
>
     (b) annotated: are the data annotated and if so with what information?
>
     (c) resolution or grid size
>
     (d) for 3D data (e.g. MRIS) resolutions in all dimensions
>


6. Other Relevant Information
   -- Include any additional information about the data that the researcher 
      may find useful. (Note there is a separate document for including
      task specific information.) For example:

      (a) Prior Knowledge
          -- Describe the types of prior knowledge that could be used, 
             (provide references if possible). For example:  
             (i) transformations of the data
             (ii) class hierarchies
             (iii) known relevant and irrelevant variables
             (iv) constraints: e.g. variables could be positive or restricted
                  to a certain range.
>
Mean number of vroot visits per case: 3.0

7. Data Format
   -- Describe how the data is stored in the archive. List the files, their
      contents, and their format (if not in plain text).
   -- Please describe thoroughly any data formats used which are unlikely
      to be known by non specialists. If necessary, use a separate file
      to describe the format.
   -- If there are different versions of the data, please describe the
      differences. 
>
The data is in an ASCII-based sparse-data format called "DST".
Each line of the data file starts with a letter which tells the line's type.
The three line types of interest are:
  -- Attribute lines:
       For example, 'A,1277,1,"NetShow for PowerPoint","/stream"'
       Where:
         'A' marks this as an attribute line, 
         '1277' is the attribute ID number for an area of the website 
           (called a Vroot), 
         '1' may be ignored, 
         '"NetShow for PowerPoint"' is the title of the Vroot, 
         '"/stream"' is the URL relative to "http://www.microsoft.com"

Case and Vote Lines:
  For each user, there is a case line followed by zero or more vote lines.
  For example:
    C,"10164",10164
    V,1123,1
    V,1009,1
    V,1052,1
  Where:
    'C' marks this as a case line, 
    '10164' is the case ID number of a user,
    'V' marks the vote lines for this case, 
    '1123', 1009', 1052' are the attributes ID's of Vroots that a user visited.
    '1' may be ignored.


8. Past Usage
    -- Include references that discuss the data itself and how it was 
       collected. These references could be either classic and widely
       cited papers that were the first to announce the data and its 
       characteristics, or possibly a domain specific web site.
>
J. Breese, D. Heckerman., C. Kadie _Empirical Analysis of
Predictive Algorithms for Collaborative Filtering_ Proceedings
of the Fourteenth Conference on Uncertainty in Artificial Intelligence,
Madison, WI, July, 1998.

   
9. Acknowledgements, Copyright Information, and Availability 
  (a) copyright information
>
  (b) usage restrictions (e.g. for reseach only)
>
  (c) citation requests
>
  (d) acknowledgements
>

10. References & Further Information
  -- Include here references to additional information that describes the
     data itself. (Note there is another document for references that describe
     analyses of the data).
  (a) pointers to tutorial/background information on the domain
  (b) other useful web sites (parent archives, domain specific sites)
  (c) where to get more data (if this is a subsample)
  (e) other relevant publications
  (f) any additional comments on this dataset
>