{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Homework 4 -- Name(s) here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Due Thursday, November 12 by midnight**. You may submit this assignment in groups of 2. Be sure to put your names above. Also, don't forget to submit your data files."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this homework, we approach the problem of identifying a small subset of a dataset using unsupervised and supervised methods. The dataset we'll use is a set of newsgroup postings from the early days of the internet.\n",
    "\n",
    "In a real world application, we often don’t have labels, and clustering and outlier detection are usually applied in settings that don’t have labels. For the clustering part of this homework, you should work without the ground truth labels as much as possible. Often inspecting and visualizing the data is the only way to understand the result of clustering.\n",
    "\n",
    "However, since we do have ground-truth we could do a post-hoc analysis and determine how well we actually did. You  might explore that after you've done the clustering.\n",
    "\n",
    "For tasks 1-3 you don’t need to split the data or use cross-validation. For task 4, you need to use the standard split methods for supervised learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rec.sport.hockey            600\n",
      "soc.religion.christian      599\n",
      "rec.motorcycles             598\n",
      "rec.sport.baseball          597\n",
      "sci.crypt                   595\n",
      "sci.med                     594\n",
      "rec.autos                   594\n",
      "sci.space                   593\n",
      "comp.windows.x              593\n",
      "comp.os.ms-windows.misc     591\n",
      "sci.electronics             591\n",
      "comp.sys.ibm.pc.hardware    590\n",
      "misc.forsale                585\n",
      "comp.graphics               584\n",
      "comp.sys.mac.hardware       578\n",
      "talk.politics.mideast       564\n",
      "talk.politics.guns          546\n",
      "alt.atheism                 480\n",
      "talk.politics.misc          465\n",
      "talk.religion.misc          377\n",
      "Name: Category Name, dtype: int64\n",
      "From: lerxst@wam.umd.edu (where's my thing)\n",
      "Subject: WHAT car is this!?\n",
      "Nntp-Posting-Host: rac3.wam.umd.edu\n",
      "Organization: University of Maryland, College Park\n",
      "Lines: 15\n",
      "\n",
      " I was wondering if anyone out there could enlighten me on this car I saw\n",
      "the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
      "early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
      "the front bumper was separate from the rest of the body. This is \n",
      "all I know. If anyone can tellme a model name, engine specs, years\n",
      "of production, where this car is made, history, or whatever info you\n",
      "have on this funky looking car, please e-mail.\n",
      "\n",
      "Thanks,\n",
      "- IL\n",
      "   ---- brought to you by your neighborhood Lerxst ----\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Text</th>\n",
       "      <th>Category Label</th>\n",
       "      <th>Category Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>From: lerxst@wam.umd.edu (where's my thing)\\nS...</td>\n",
       "      <td>7</td>\n",
       "      <td>rec.autos</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>From: guykuo@carson.u.washington.edu (Guy Kuo)...</td>\n",
       "      <td>4</td>\n",
       "      <td>comp.sys.mac.hardware</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>From: twillis@ec.ecn.purdue.edu (Thomas E Will...</td>\n",
       "      <td>4</td>\n",
       "      <td>comp.sys.mac.hardware</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>From: jgreen@amber (Joe Green)\\nSubject: Re: W...</td>\n",
       "      <td>1</td>\n",
       "      <td>comp.graphics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>From: jcm@head-cfa.harvard.edu (Jonathan McDow...</td>\n",
       "      <td>14</td>\n",
       "      <td>sci.space</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11309</th>\n",
       "      <td>From: jim.zisfein@factory.com (Jim Zisfein) \\n...</td>\n",
       "      <td>13</td>\n",
       "      <td>sci.med</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11310</th>\n",
       "      <td>From: ebodin@pearl.tufts.edu\\nSubject: Screen ...</td>\n",
       "      <td>4</td>\n",
       "      <td>comp.sys.mac.hardware</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11311</th>\n",
       "      <td>From: westes@netcom.com (Will Estes)\\nSubject:...</td>\n",
       "      <td>3</td>\n",
       "      <td>comp.sys.ibm.pc.hardware</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11312</th>\n",
       "      <td>From: steve@hcrlgw (Steven Collins)\\nSubject: ...</td>\n",
       "      <td>1</td>\n",
       "      <td>comp.graphics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11313</th>\n",
       "      <td>From: gunning@cco.caltech.edu (Kevin J. Gunnin...</td>\n",
       "      <td>8</td>\n",
       "      <td>rec.motorcycles</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>11314 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                    Text  Category Label  \\\n",
       "0      From: lerxst@wam.umd.edu (where's my thing)\\nS...               7   \n",
       "1      From: guykuo@carson.u.washington.edu (Guy Kuo)...               4   \n",
       "2      From: twillis@ec.ecn.purdue.edu (Thomas E Will...               4   \n",
       "3      From: jgreen@amber (Joe Green)\\nSubject: Re: W...               1   \n",
       "4      From: jcm@head-cfa.harvard.edu (Jonathan McDow...              14   \n",
       "...                                                  ...             ...   \n",
       "11309  From: jim.zisfein@factory.com (Jim Zisfein) \\n...              13   \n",
       "11310  From: ebodin@pearl.tufts.edu\\nSubject: Screen ...               4   \n",
       "11311  From: westes@netcom.com (Will Estes)\\nSubject:...               3   \n",
       "11312  From: steve@hcrlgw (Steven Collins)\\nSubject: ...               1   \n",
       "11313  From: gunning@cco.caltech.edu (Kevin J. Gunnin...               8   \n",
       "\n",
       "                  Category Name  \n",
       "0                     rec.autos  \n",
       "1         comp.sys.mac.hardware  \n",
       "2         comp.sys.mac.hardware  \n",
       "3                 comp.graphics  \n",
       "4                     sci.space  \n",
       "...                         ...  \n",
       "11309                   sci.med  \n",
       "11310     comp.sys.mac.hardware  \n",
       "11311  comp.sys.ibm.pc.hardware  \n",
       "11312             comp.graphics  \n",
       "11313           rec.motorcycles  \n",
       "\n",
       "[11314 rows x 3 columns]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.datasets import fetch_20newsgroups\n",
    "newsgroups = fetch_20newsgroups()\n",
    "\n",
    "all_df = pd.DataFrame({'Text': newsgroups.data, 'Category Label': newsgroups.target})\n",
    "all_df['Category Name'] = all_df['Category Label'].map(lambda idx: newsgroups.target_names[idx])\n",
    "print(all_df['Category Name'].value_counts())\n",
    "print(all_df['Text'][0])\n",
    "all_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Task 0: Split the Data (2 points)\n",
    "\n",
    "Choose any 3-6 categories. Create a new dataframe containing only the rows in those categories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Task 1: Vectorization (6 points)\n",
    "\n",
    "## 1.1 Plain Tfidf\n",
    "\n",
    "Vectorize the data using TfIdf. Do not change the defaults. What are the top 10 most 'important' (highest tfidf score) words?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.2 Better Tfidf\n",
    "\n",
    "Now vectorize again but this time choose reasonable values for some of the parameters, such as stop_words. What are the top 10 most 'important' terms now?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2: Dimensionality Reduction and Visualization (8 points)\n",
    "\n",
    "## 2.1 PCA\n",
    "\n",
    "Use PCA to reduce to 2 dimensions and plot the data. Color each point by its category. Make sure the plot has a useful legend.\n",
    "\n",
    "Do you see anything notable?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.2 T-SNE\n",
    "\n",
    "Now do the same with T-SNE. See if tuning the perplexity parameter helps obtaining a better visualization.\n",
    "\n",
    "Do you see anything notable?\n",
    "\n",
    "Note: You may want to re-run Tfidf and set the max_features parameter to some reasonable value (e.g., 100). Otherwise TSNE may take a very long time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 3: Clustering (12 points)\n",
    "\n",
    "Now we will cluster the Tfidf-vectorized data. (Do NOT cluster on the PCA- or TSNE-reduced data.) For each algorithm, try to manually tune the parameters for a reasonable outcome and document how you tuned the parameters. In particular pay attention to the sizes of the clusters created.\n",
    "\n",
    "Manually inspect the outcomes as well as you can and identify if any of the resulting clusters are meaningful (as far as you can tell).\n",
    "\n",
    "### 3.1 KMeans\n",
    "\n",
    "Use KMeans to cluster the data. Use the PCA or TSNE projections (your choice) to make another scatterplot, but this time color the points by their cluster label from KMeans.\n",
    "\n",
    "On the same plot, plot the cluster centers that KMeans found. This will mean you'll need to use your fit PCA or TSNE model to transform into 2 dimensions. Plot these with a different marker shape.\n",
    "\n",
    "For each cluster center, find the 5 closest posts and print them. Do you notice anything useful? Hint: `KMeans` has a transform method that might be useful. Hint 2: `np.argsort` is your friend."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 Agglomerative Clustering\n",
    "\n",
    "Use Agglomerative Clustering (with ward linkage) to cluster the data. Create a dendrogram for agglomerative clustering (the truncate_mode='level' might be useful).\n",
    "\n",
    "Manually inspect the outcomes as well as you can and identify if any of the resulting clusters are meaningful (as far as you can tell)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.3 DBSCAN\n",
    "\n",
    "Use DBSCAN to cluster the data. Make a scatterplot of the results using the PCA or TSNE transformations. You don't need to worry about plotting the cluster centers in this case.\n",
    "\n",
    "Manually inspect the outcomes as well as you can and identify if any of the resulting clusters are meaningful (as far as you can tell)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.4 Subgroups\n",
    "\n",
    "Now pick one of your categories. Run Tfidf, PCA or TSNE, and one of the clustering algorithms on JUST that category. Visualize the results in some way and inspect the outcome of the clustering. Do you see any relevant groups?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Task 4: Classification (12 points)\n",
    "\n",
    "Now it's time to see if we can build a classifier.\n",
    "\n",
    "## 4.1 Baseline\n",
    "\n",
    "Pick a linear model and create a simple baseline classifier.\n",
    "\n",
    "What words/terms does the model think are most important? Try to interpret your model.\n",
    "\n",
    "Build non-linear features or derived features from a provided column. Try to improve the performance of a linear model with these."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2 Linear Model\n",
    "\n",
    "Try to improve your baseline by trying several (2-3) different linear classifiers, making sure to carefully tune hyperparameters and use cross-validation.\n",
    "\n",
    "What words/terms does your best model think are most important? Try to interpret your model.\n",
    "\n",
    "Optional: You are welcome to add extra features you think might be useful. That might be the count of a particular word in each post (e.g., a popular misspelling of a word), whether or not a particular email address is mentioned, etc. There is not a ton of room for customization here, but there are some things to try."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3 Non-Linear Model\n",
    "\n",
    "Now do the same as above but use non-linear models. Try out 2-3 of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}