Introduction to Data Mining
Project 1: Data Pre-Processing
In this project, students are to program data pre-processing techniques on gene expression datasets.
The dataset (P1InputData.csv) provided in the project folder contains 62 samples collected from colon-cancer patients of two classes; there are 22 positive tuples and 40 negative ones. Each tuple (row) consists of the readings for the genes and the class (which is the last column) on one biopsy. Each gene is an attribute. The columns are separated by ",". We number the genes 1 to N in the left-to-right order; we will refer to the genes using gi where i is a column number; for example the first gene (column) is called g1.
Your program should work on other datasets with similar formats but they may have different number of rows and different number of columns (perhaps also different class names). You can assume that there are exactly two classes. So your program may need a scan of the data to determine the number of genes and the number of instances/rows.
Your (compiled) program will be run using the following command-line command:
java DPP datafilename k m
Your program will perform the following two tasks.
Task 1. Discretize, rank, and select the top-k genes of the data using the entropy-based method (for 2 intervals). This task produces three files; only the k highest ranked genes (in the information gain order) will be included in these files:
(a) A file (called entropyRank.csv) containing the entropy-based ranking of the k genes, in decreasing information gain order. Each row of the file should contain a gene number, the split value determined by the entropy-based binning method for the gene, and the information gain of the split. The three values should be separated by commas.
(b) A file (called entropyItemMap.csv); each row in this file contains four comma-separated values gi, lb,rb,j, where gi is a gene ID, lb is the left bound and rb is right bound of an interval/bin, and j is the integer to be used to represent the interval in the itemized data.
(c) A file (called entropyItemizedData.txt) containing the itemized data for the top-k genes. Each row in this file corresponds to a row in the input data. Use commas to separate items in each row. The first item of the rows should be about the highest ranked gene, the second item should be about the second highest ranked gene, and so on.
Task 2. Discretize the top-k genes selected in Task 1, using the equi-density binning method into m intervals. This task will produce two files:
(a) A file (called equiDensityItemMap.csv); each row of this file contains four comma-separated values gi, lb,rb,j, where gi is a gene ID, lb is the left bound and rb is right bound of an interval, and j is the integer to be used to represent the interval in the itemized data.
(d) (b) A file (called equiDensityItemizedData.txt) containing the itemized data for the top-k genes. Again the first item of the rows should be about the highest entropy-ranked gene, the second item should be about the second highest entropy-ranked gene, and so on.
Slide 46 of 03Preprocessing.ppt contains an example dataset, a discretization map, and an itemized dataset. The dataset you will work with does not contain the row for gene names.
You need to write your program in Java. When complied your program will produce an executable program called DPP. The program should be able to do the two tasks described above using a command of the form “java DPP datafilename 3 4” when it is run with the “datafilename” file in the same folder where the program is. The “3” is the k (number of genes for entropy-based ranking) and the “4” is the m (number of intervals for equidensity).
Introduction to Data Mining Project 1: Data Pre-Processing Java Solution
This solution has two parts : Task 1 and Task 2.
Both java programs have mentioned in solution as pdf.Beware: We are not responsible for any Plagiarism issues. Just change accordingly.