Guidelines for Documenting Databases: Task Information The purpose of this page is to summarize the methods and results by the author and others in the literature for the specific task on the indicated data set. For example, a task page might summarize the relevant work that has been published to date on predicting the Dow Jones index at a daily level. When filling out this form, simply place your answer after the point indicated by '>'. We will then process the form to ensure that all documentation files follow a common format. 1. Database Used -- Indicate the corresponding database for this task. > Anonymous web data from www.microsoft.com 2. Task Type -- Indicate the task: (association rules, classification, clustering, control, density estimation, exploratory data analysis, image/spatial modelling, regression, retrieval, time series prediction) -- If the task is not listed above, please describe it. > Classification, Collaborative Filtering 3. Source (a) Donor of task information (name/snail address/phone/email/homepage) > Jack S. Breese, David Heckerman, Carl M. Kadie Microsoft Research, Redmond WA, 98052-6399, USA breese@microsoft.com, heckerma@microsoft.com, carlk@microsoft.com 4. Problem Description -- Provide a detailed description of the data analysis problem. The description should answer the following questions: (a) What is the data analysis task? > Predicting what areas of www.microsoft.com a user visited based on data on what other areas he or she visited. (b) What are the criteria and constraints for judging the quality of solutions (e.g. minimize loss, comprehensibility, response time, etc.)? > Predictive accuracy Learning time Speed of predictions 5. Preprocessing and Modifications -- Describe any additional preprocessing or modifications of the original data (i.e. data already in the archive) for this analysis. > None 6. Other Relevant Information -- Include any additional information that the researcher may find useful. For example: (a) Suggested Experimental Procedure -- is there a suggested experimental procedure to evaluate algorithms? -- are there recommended train/tune/test sets -- are there variables (features, attributes) that should not be used for prediction and are for information purposes only? (b) Cost information (if applicable/available) -- e.g. loss matrix for misclassification errors (c) Other miscellaneous information -- e.g. Are there well known physical or theoretical models for the process or for individual variables? > Experimental procedures are described in: J. Breese, D. Heckerman., C. Kadie _Empirical Analysis of Predictive Algorithms for Collaborative Filtering_ Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, July, 1998. The train- and test set used in this paper are provided as 'anonymous-mswebtrain.dst' and 'anonymous-mswebtest.dst' 7. Results -- Include references and a brief summary of key papers that report results on this dataset. Each entry should include: (a) The complete reference of the article where it was described/used (with a link to an online version if possible) (b) The study's purpose: for example, did the paper introduce a new a new algorithm, or present a comparison of several approaches. -- Briefly describe the algorithms used. Indicate the types of model structures used, as well as the fitting procedure. For example, the model structure could be a 1-hidden layer neural network trained with backpropagation. -- Indicate if any special data structures were used to organize the data (e.g., B*-trees, etc). (c) The major findings: for example, which algorithms worked well or poorly. > Results for this dataset are reported in: J. Breese, D. Heckerman., C. Kadie _Empirical Analysis of Predictive Algorithms for Collaborative Filtering_ Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, July, 1998. This paper presents a comparison of a number of memory-based (correlation and vector similarity techniques) as well as model-based (cluster models and Bayesian networks) methods. In terms of predictive accuracy, the results indicate that the authors' Bayesian network approach to collaborative filtering is the best performing approach on this dataset. 8. References & Further Information -- Include here references to additional information that focuses on the analysis of the data. (Note there is another document for references that describe the data itself). (a) pointers to tutorial/background information (b) other useful web sites (parent archives, domain specific sites) (d) online documentation or papers (e) other relevant publications (f) any additional comments on this dataset > Results on this dataset were expanded as Microsoft Research Technical Report MSR-TR-98-12. The papers are available on-line at: http://research.microsoft.com/users/breese/cfalgs.html