top of page

Assignment 2 - CS7800 Summer 2022 Python(Solved)

Updated: Jul 24, 2022

Wright State University Department of Computer Science and Engineering

CS7800 Summer 2022 Python Assignment 2 (Solution)

In this project, you will implement several different classifiers in Python using scikit-learn APIs. The project has two phases. In Phase I, you will build classifiers and run them

on the same modified 20 Newsgroup dataset. In Phase II, you will evaluate these

classifiers to elucidate their quality of predictions (e.g., using accuracy and confusion


The project may be done in a team of two or three members like before, to promote

discussions and insights. If a fourth member is included because someone cannot find a

teammate, the team must also implement and evaluate two of the classifiers (e.g., kNN

and Rocchio) manually from scratch as required below. All the team members are

expected to contribute to all aspects of the project: design, implementation,

documentation, and testing, for their own good.

Phase I

scikit-learn provides a mature set of APIs for building models using regression,

classification and clustering techniques, and has been used extensively for prediction


Classification on Newsgroups

For this project, you will use a subset of the 20 Newsgroups dataset. The full data

set contains 20,000 newsgroup documents, partitioned (nearly) evenly across 20

different newsgroups and has been used for experiments in text applications of

machine learning techniques, such as text classification and text clustering. This

assignment dataset contains a pre-processed subset of 1000 documents and a

vocabulary (dictionary) of 5,500 terms. As you are already familiar with the text pre-processing pipeline for parsing and converting text into sequence of terms, we are

providing associated term-document matrix representation of the dataset as input.

Each document belongs to one of two classes Hockey (class label 1) and Microsoft

Windows (class label 0). The data has already been split (80%, 20%) into training

data and test data. The class labels for training data and test data are also provided in

separate files. The training data and test data, in term x document format, contains a

row for each term in the vocabulary and a column for each document. The values in

the table represent raw term occurrence counts. The data has already been

preprocessed to extract tokens, remove stop words and perform stemming (so, the

terms in the vocabulary are stems, not full terms). Please be sure to read the

readme.txt file in the distribution. Your task is to exercise several different classifiers

available in scikit-learn for classification on the given 2Newsgroups dataset.

Specifically, you must use K-Nearest-Neighbor (kNN), Centroid-basedRocchio Method,

Naive Bayes (NB) and Support Vector Machine (SVM) classifiers. You may additionally

use Pandas, NumPy, standard Python libraries, and Matplotlib in your programs.

Milestones for Phase I

• Read and understand the input dataset format.

• Implement a program to exercise the different classifiers given.

• For a four member team: implement K-Nearest-Neighbor (kNN) and Centroid-based Rocchio classifiers from scratch (that is, without using scikit-learn APIs)

in Python.

Phase I resources

• Dataset:

• Relevant tutorials and descriptions

• See:

• See:

• See:

Phase I environment setup

• Python 3 or higher

• Do not use Jupiter notebook

• Install scikit-learn:

• See:

• Name your well-documented Python script as

• Students are required to use a Python virtual environment. For those unfamiliar

with Python virtualenvironments here are two reference articles explaining how

to use and activate:

• See:

• See:

• Once you finish this assignment and are ready to submit it to pilot, you need to

create a requirements.txt file with all the libraries used in your development. Use

the following command onlyafter you’ve activated your Python environment.

• source venv/bin/activate

• pip list --format=freeze > requirements.txt

• deactivate

• See:

Phase II

Briefly discuss the evaluation metrics to be used to elucidate and quantify the

quality of classifier pre-dictions.

Milestones for Phase II

• For each classifier, at the minimum, provide confusion matrix and accuracy.

• For a four-member team: additionally, evaluate and compare K-Nearest-Neighbor

(kNN) and Centroid-based Rocchio classifiers implemented from scratch with

those implemented using scikit- learn APIs in Python.


TURN IN: Upload one tar archive file per team that contains the following files:

 Code and accompanying documentation: Include well-documented

source code for the entireproject.

 Evaluation information: Provide comparative analysis of the different classifier

performance usingthe chosen evaluation metrics.

 README.txt: This document should briefly explain your application, how to

launch the application, any external libraries used, what version of Python 3

used, all team members names and UIDs, and any other relevant information.

The more information you provide in this document the easier it is for us to grade

andgive extra credit if something does not work on our system. Additionally, all

team member names, and email addresses must be included in this document.

Application execution: Make sure that your Python program runs from the command-line. We will use the following command to execute your code: python3

  • Do not use any absolute path for input files or any data files. All paths should be

local to your working directory. We suggest testing your application on one of your

teammates computers to make sure everything works, and you did not hardcode

anything. Points will be deducted if you use absolute paths for your input and output files.

  • Do not use Jupyter notebook, as your application must launch from the command-line

with the above given syntax.

  • Input and output files: All file(s) must be placed in your working directory, all

generated output file(s) must also be placed in your working directory, and no


  • requirements.txt: all the libraries used in your virtual Python environment.

  • TAR archive: Submit your application using the following command exactly as

written to tar up your working directory: “tar -zcvf assignment2.tgz


  • Upload the archive onto Pilot $>$ Dropbox $>$ Assignment 2 folder.

(Only one submission per team.) You are also expected to demo your program

to us (if necessary) and be prepared to answer questions about its design,

implementation, and comparative evaluation.

Grading Criteria

You must obtain a PASS on this assignment to PASS the course. At the minimum, your code

should compile, process some queries, and return reasonable results. A passing grade is 60%.

Assignments are designed to help you learn the course concepts and are the primary

course "homework". Corrupt files or other computer problems will not be considered a

sufficient excuse to extend a deadline. It is your responsibility to back-up your work. We

strongly suggest that you save your work to multiple locations/media to aid in the recovery of

corrupt files. If you have questions regarding the projects, you can contact the instructor or

the GTA.

Assignments that are submitted late will incur a penalty of 25% reduction on total grade

per day the assignment is late. The project must be turned in Pilot as described in the project

description to receive full credit. Assignments emailed to the GTA, or professor will receive an immediate 25% reduction in total grade.

For Python Solution :

Please Contact at +91-995 3141 035

14 views0 comments


bottom of page