Assignment 3 of CS412A/555A, 2011

Due date: Nov 8 (midnight)
Submission: email submission ONLY
Individual effort (no group work)
Total marks: 10% of the final marks

1. Select at least 3 large (with over 500 instances) regression datasets (the output is numerical instead of binary or discrete class) from the UCI Machine Learning Repository, or datasets from the WEKA website.

You will apply the following WEKA regression algorithms on these datasets.

Linear regression, under classifiers/functions/LinearRegression
Multilayer perceptrons (neural networks), under Classify/function/MultilayerPerceptron
Distance-weighted k-NN, under Classify/lazy/IBK

Show what you have done and analyze the results (comparing the total error, speed of the algorithms, etc.).

2. Generate two synthetic 2D datasets (with at least 1,000 data points each) that you think K-means clustering algorithm would work and would not work respectively. Use simpleKMeans in Weka to cluster the two datasets several times and visualize the results to see whether simpleKMeans really works or not. Try different k values. Based on the results and your original hypothesis, please point out whether the results are consistent with your hypothesis, and try to analyze the reason why simpleKMeans fails to work or works well.

Note: To generate the synthetic datasets, you can use tools such as Matlab or you can write a program in whatever language you prefer.

Please submit files containing the two synthetic datasets, your code that generates the data, visualization of your clustering results in Weka, and the analysis of the results.