This data set consists of 20000 messages taken from 20 newsgroups.
Tom Mitchell School of Computer Science Carnegie Mellon University tom.mitchell@cmu.eduDate Donated: September 9, 1999
One thousand Usenet articles were taken from each of the following 20 newsgroups.
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
Approximately 4% of the articles are crossposted. The articles are typical postings and thus have headers including subject lines, signature files, and quoted portions of other articles.
Each newsgroup is stored in a subdirectory, with each article stored as a separate file.
T. Mitchell. Machine Learning, McGraw Hill, 1997.
T. Joachims (1996). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, Computer Science Technical Report CMU-CS-96-118. Carnegie Mellon University.
Naive Bayes code for text classification is available from: http://www.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html