Obtain a large dataset about SAT test results, IQ test, statistics about Canadians, house prices (e.g., MLS), life span, computer prices, etc., and apply data mining (classification, regression, clustering, and/or association rule mining) to the dataset. Analyze the results and the knowledge discovered.

Find some simple image or speech data, and apply data mining (classification or clustering) and analyze the results.

Search and study at least 5 papers on the topic that you are interested in, such as mining financial data, applications using clustering, mining health data, and so on. Write a survey paper on the topic.

Compare two algorithms (such as Naive Bayes and k-NN) extensively using 20+ datasets. Study for what kind of datasets NB would be better, and what kind k-NN would be better. This itself can be a data mining problem (finding features describing the datasets, and mine what kind of data one algorithm would be better).

Study k-means on artificial and read-world datasets. Describe and discuss various ways to find the best k values.

Test and verify the cost-sensitive learning. One standard method is thresholding (choose the theoretical value using FP and FN values given). Try to use a classifier with reasonable or good probability estimation (such as bagged decision trees), and use cross-validation to find the best threshold for minimal weighted cost. Is this threshold the same as the theoretical value?

Another popular data-mining tool is RapidMiner (R, SAS, ...). Compare WEKA and RapidMiner on their functionalities, usages, and algorithms provided, etc.