20 Newsgroups

Data Type

text

Abstract

This data set consists of 20000 messages taken from 20 newsgroups.

Sources

Original Owner and Donor

Tom Mitchell
School of Computer Science 
Carnegie Mellon University
tom.mitchell@cmu.edu
Date Donated: September 9, 1999

Data Characteristics

One thousand Usenet articles were taken from each of the following 20 newsgroups.

    alt.atheism
    comp.graphics
    comp.os.ms-windows.misc
    comp.sys.ibm.pc.hardware
    comp.sys.mac.hardware
    comp.windows.x
    misc.forsale
    rec.autos
    rec.motorcycles
    rec.sport.baseball
    rec.sport.hockey
    sci.crypt
    sci.electronics
    sci.med
    sci.space
    soc.religion.christian
    talk.politics.guns
    talk.politics.mideast
    talk.politics.misc
    talk.religion.misc

Approximately 4% of the articles are crossposted. The articles are typical postings and thus have headers including subject lines, signature files, and quoted portions of other articles.

Data Format

Each newsgroup is stored in a subdirectory, with each article stored as a separate file.

Past Usage

T. Mitchell. Machine Learning, McGraw Hill, 1997.

T. Joachims (1996). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, Computer Science Technical Report CMU-CS-96-118. Carnegie Mellon University.

Acknowledgements, Copyright Information, and Availability

You may use this material free of charge for any educational purpose, provided attribution is given in any lectures or publications that make use of this material.

References and Further Information

Naive Bayes code for text classification is available from: http://www.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html


The UCI KDD Archive
Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425
Last modified: September 9, 1999