This data set consists of 20000 messages taken from 20 newsgroups.
Tom Mitchell School of Computer Science Carnegie Mellon University firstname.lastname@example.orgDate Donated: September 9, 1999
One thousand Usenet articles were taken from each of the following 20 newsgroups.
alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc
Approximately 4% of the articles are crossposted. The articles are typical postings and thus have headers including subject lines, signature files, and quoted portions of other articles.
Each newsgroup is stored in a subdirectory, with each article stored as a separate file.
T. Mitchell. Machine Learning, McGraw Hill, 1997.
T. Joachims (1996). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, Computer Science Technical Report CMU-CS-96-118. Carnegie Mellon University.
Naive Bayes code for text classification is available from: http://www.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html