Anonymous web data from www.microsoft.com

Data Type

relational, multivariate

Abstract

This dataset records which areas (Vroots) of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.

Sources

Original Owner and Donor

Jack S. Breese, David Heckerman, Carl M. Kadie
Microsoft Research, Redmond WA, 98052-6399, USA
breese@microsoft.com, heckerma@microsoft.com, carlk@microsoft.com
Date Donated: November 30, 1998

Data Characteristics

The data was created by sampling and processing the www.microsoft.com logs. The data records the use of www.microsoft.com by 38000 anonymous, randomly-selected users. For each user, the data lists all the areas of the web site (Vroots) that the user visited in a one week timeframe.

Users are identified only by a sequential number, for example, User #14988, User #14989, etc. The file contains no personally identifiable information. The 294 Vroots are identified by their title (e.g. "NetShow for PowerPoint") and URL (e.g. "/stream"). The data comes from one week in February, 1998.

Each instance represents an anonymous, randomly selected user of the web site. Each attribute is an area ("vroot") of the www.microsoft.com web site.

Missing Attribute Values: The data is very sparse, so vroot visits are explicit, nonvisits are implicit (missing).

Summary Statistics

Training Instances 32711
Testing Instances 5000
Attributes 294
Mean vroot visits per case 3.0

Data Format

The data is in an ASCII-based sparse-data format called "DST". Each line of the data file starts with a letter which tells the line's type. The three line types of interest are:
-- Attribute lines:
For example, 'A,1277,1,"NetShow for PowerPoint","/stream"'
Where:
  'A' marks this as an attribute line, 
  '1277' is the attribute ID number for an area of the website (called a Vroot),
  '1' may be ignored, 
  '"NetShow for PowerPoint"' is the title of the Vroot, 
  '"/stream"' is the URL relative to "http://www.microsoft.com"

Case and Vote Lines:
For each user, there is a case line followed by zero or more vote lines.
For example:
  C,"10164",10164
  V,1123,1
  V,1009,1
  V,1052,1
Where:
  'C' marks this as a case line, 
  '10164' is the case ID number of a user, 
  'V' marks the vote lines for this case, 
  '1123', 1009', 1052' are the attributes ID's of Vroots that a user visited. 
  '1' may be ignored.

Past Usage

J. Breese, D. Heckerman., C. Kadie. Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, July, 1998.


The UCI KDD Archive
Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425
Last modified: July 12, 1999