{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework 4 -- Name(s) here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Due Thursday, November 12 by midnight**. You may submit this assignment in groups of 2. Be sure to put your names above. Also, don't forget to submit your data files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this homework, we approach the problem of identifying a small subset of a dataset using unsupervised and supervised methods. The dataset we'll use is a set of newsgroup postings from the early days of the internet.\n", "\n", "In a real world application, we often don’t have labels, and clustering and outlier detection are usually applied in settings that don’t have labels. For the clustering part of this homework, you should work without the ground truth labels as much as possible. Often inspecting and visualizing the data is the only way to understand the result of clustering.\n", "\n", "However, since we do have ground-truth we could do a post-hoc analysis and determine how well we actually did. You might explore that after you've done the clustering.\n", "\n", "For tasks 1-3 you don’t need to split the data or use cross-validation. For task 4, you need to use the standard split methods for supervised learning." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rec.sport.hockey 600\n", "soc.religion.christian 599\n", "rec.motorcycles 598\n", "rec.sport.baseball 597\n", "sci.crypt 595\n", "sci.med 594\n", "rec.autos 594\n", "sci.space 593\n", "comp.windows.x 593\n", "comp.os.ms-windows.misc 591\n", "sci.electronics 591\n", "comp.sys.ibm.pc.hardware 590\n", "misc.forsale 585\n", "comp.graphics 584\n", "comp.sys.mac.hardware 578\n", "talk.politics.mideast 564\n", "talk.politics.guns 546\n", "alt.atheism 480\n", "talk.politics.misc 465\n", "talk.religion.misc 377\n", "Name: Category Name, dtype: int64\n", "From: lerxst@wam.umd.edu (where's my thing)\n", "Subject: WHAT car is this!?\n", "Nntp-Posting-Host: rac3.wam.umd.edu\n", "Organization: University of Maryland, College Park\n", "Lines: 15\n", "\n", " I was wondering if anyone out there could enlighten me on this car I saw\n", "the other day. It was a 2-door sports car, looked to be from the late 60s/\n", "early 70s. It was called a Bricklin. The doors were really small. In addition,\n", "the front bumper was separate from the rest of the body. This is \n", "all I know. If anyone can tellme a model name, engine specs, years\n", "of production, where this car is made, history, or whatever info you\n", "have on this funky looking car, please e-mail.\n", "\n", "Thanks,\n", "- IL\n", " ---- brought to you by your neighborhood Lerxst ----\n", "\n", "\n", "\n", "\n", "\n" ] }, { "data": { "text/html": [ "
\n", " | Text | \n", "Category Label | \n", "Category Name | \n", "
---|---|---|---|
0 | \n", "From: lerxst@wam.umd.edu (where's my thing)\\nS... | \n", "7 | \n", "rec.autos | \n", "
1 | \n", "From: guykuo@carson.u.washington.edu (Guy Kuo)... | \n", "4 | \n", "comp.sys.mac.hardware | \n", "
2 | \n", "From: twillis@ec.ecn.purdue.edu (Thomas E Will... | \n", "4 | \n", "comp.sys.mac.hardware | \n", "
3 | \n", "From: jgreen@amber (Joe Green)\\nSubject: Re: W... | \n", "1 | \n", "comp.graphics | \n", "
4 | \n", "From: jcm@head-cfa.harvard.edu (Jonathan McDow... | \n", "14 | \n", "sci.space | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
11309 | \n", "From: jim.zisfein@factory.com (Jim Zisfein) \\n... | \n", "13 | \n", "sci.med | \n", "
11310 | \n", "From: ebodin@pearl.tufts.edu\\nSubject: Screen ... | \n", "4 | \n", "comp.sys.mac.hardware | \n", "
11311 | \n", "From: westes@netcom.com (Will Estes)\\nSubject:... | \n", "3 | \n", "comp.sys.ibm.pc.hardware | \n", "
11312 | \n", "From: steve@hcrlgw (Steven Collins)\\nSubject: ... | \n", "1 | \n", "comp.graphics | \n", "
11313 | \n", "From: gunning@cco.caltech.edu (Kevin J. Gunnin... | \n", "8 | \n", "rec.motorcycles | \n", "
11314 rows × 3 columns
\n", "