Data Mining

This course work need SAS9.2 to sort out, and the data row i will send the attachment via email. and there is a e-book will be helpful will be in the attachment as well, meanwhile the book of Pearson international Edition’s Introduction to Data Mining will be helpful as well, the pic of the book wil be in the emial and im in UK but the country selction dont have UK i must get a first class level

Problem Description

The objective of this coursework is to develop a suitable model to classify the German credit dataset. Submit a report of maximum length 20 pages that documents the process you adopted and the reasons why, as well as the obtained results.

Focus your answers on explaining the rationalising the process you followed (why you made the different choices and how you evaluated different alterna- tives). This coursework is not meant to assess your skills in using SAS“EM, but rather your understanding of the data mining process. Therefore, details of how different tasks are performed in SAS“EM are irrelevant. It is also insuf-ficient to report the outcome from executing different SAS“EM nodes without

commenting upon it and explaining the implications for the problem at hand.

Without partitioning the dataset, perform an exploratory analysis of the data using visualisation and statistical analysis tools. In particular:

“ Consider the distribution of the target variable. Are the two classes balanced in the dataset? Given the nature of the task and the distri-bution of the target variable which performance evaluation measures would you propose to use for this type of problem. Rationalise your

answers.

“ Explore the distribution of each independent variable, and comment whether and why a transformation should be considered. Using vi-sualisation methods like histograms, as well as measures like infor-mation value and weights of evidence, quantify the predictive power

of each variable separately.

Assign 30% of the data as test set, 50% as training, and 20% as validation.

To ensure your analysis is unique set your own seed at the Data Partition stage of your analysis. Use as seed the number 38099.

Use variable selection methods as well as logistic regression to determine an appropriate subset of variables that can be used to build a classifier for this problem. Comment on the limitations of the different approaches and discuss your findings. For which of the classification methods that you will develop later do you expect the proposed subset to be most relevant

and why?

Develop and evaluate a classification model for each of the methods thatyou have been taught. For each method identify up to 3 parameters of

the method that you believe are important for the final performance of a

model and consider different settings. Report your findings and comment

on the sensitivity of performance with respect to each parameter setting.

Reach a final recommendation for a classification model for each method.

Discuss the strengths and weaknesses of different classifiers. (This should

not just repeat the material from the course. Try to relate this to your

experience from building a classification model for this problem.) Justify

Still stressed from student homework?
Get quality assistance from academic writers!