K6312: INFORMATION MINING & ANALYSIS
Academic Year 2012-2013, Semester 1
Course Description
This course examines the main techniques used for extracting information from large amounts of numerical data, as well as the broader process of knowledge discovery. The information extracted in data mining is often in the form of associations, patterns or specific facts hidden in the mass of data. The extracted information can be synthesized to create new knowledge, or used to construct prediction models or complex knowledge structures. The information/knowledge extracted should be useful in a tangible way to increase the profitability, efficiency, effectiveness or competitive advantage of an organization.
Data mining techniques can be divided into three types:
- Statistical techniques, including hypothesis testing, Chi-square test, t-test, multiple regression analysis and logistic regression
- Machine learning, including clustering, K-nearest neighbour, decision-tree induction, association rule induction, Bayesian learning and neural networks
- Natural language processing, including lexical, syntactic and semantic analysis.
Application of data mining in customer relationship management will be examined in the course.
Unfortunately, in this course there is not enough time to look into text mining and natural language processing. A separate course on this topic is being planned. Time series analysis, data warehousing and On-Line Analytical Processing (OLAP) will also not be covered this semester.
The class will be run partly in a project-based/problem-based style. Lab sessions and the class project will provide students with hands-on experience with data mining software. The approach taken in the course is a how-to-do-it, how-does-it-work and how-to-apply-it kind of approach, with a minimum of mathematics. The course will seek to develop the student’s commonsense ability to examine data from different angles and apply data mining techniques intelligently to tease out useful patterns in the data.
Objectives
At the end of the course, students are expected to:
- Understand the principles and concepts underlying the main data mining techniques, and their strengths and limitations;
- Apply data mining techniques and the knowledge discovery process to discover hidden information in numerical data;
- Understand the different kinds of patterns and models that can be extracted from a data set, and be able to select and use an appropriate technique for each type of pattern and model;
- Be able to interpret and evaluate the results of data mining;
- Describe how data mining can be used in real-life applications;
- Know the main features and functionalities that a good data mining and text mining tool should have.
Prerequisites
Students must have taken at least 1 semester of statistical analysis at the undergraduate level., and be comfortable analyzing data using a spreadsheet programme.
This course does not require a lot of reading. However, students must be prepared to spend sufficient time thinking and grappling with new data analysis concepts, and analyzing datasets.
The data analysis softwares used in the course (SPSS Statistical Software & SPSS Clementine) are available only in the school labs, and not available for student loan. Students have to be prepared to come to school and work in the lab to complete the assignments.
Attendance is compulsory in the first half of the semester (up till the mid-semester break). You’re likely to do poorly in the final exam if you miss even a single class.
Method of Assessment
Grading
- Class attendance, participation, homework: 5% of final grade
- Group presentation: 5%
- Mid-term test: 10%
- Assignment 1: Statistical analysis of a dataset – 15%
- Assignment 2: Analysis of a dataset using machine-learning techniques – 15%
- Final exam – 50% (closed book exam)
Assignment 1. Statistical analysis exercise
- Carry out statistical analyses on a data set to be provided by the instructor.
- Do either a group presentation for Assignment 1 or for Assignment 2.
- Write and submit a report of 8-10 pages, excluding appendices.
Assignment 2. Analysis using machine-learning
- Analyze the same data set using machine learning techniques and compare the results with the results from the statistical analyses.
- Do a group presentation, if you haven’t done so for Assignment 1.
- Write and submit a report of 8-10 pages, excluding appendices.
Readings
Recommended Texts
code book title
[Linoff] Linoff, G.S., & Berry, M.J.A. (2011). Data mining techniques for marketing, sales, and customer relationship management (3rd ed.).Indianapolis: Wiley. (ISBN 978-0-470-65093-6)
[SPSS] Carver, R.H., & Nash, J.G. (latest edition). Doing data analysis with SPSS.Belmont,CA: Brooks/Cole.
Other useful texts
[Tan] Tan, Pang-Ning, Steinbach, M., & Kumar, V. (2006). Introduction to data mining.Boston. Pearson/Addison Wesley. ISBN 0-321-32136-7.
[Witten] Witten, I.H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.).Amsterdam: Elsevier.
[Westphal] Westphal, C.R. (1998). Data mining solutions: Methods and tools for solving real-world problems.New York: Wiley.
[Manning] Manning, C.D., & Schutze, H. (2000). Foundations of statistical natural language processing.Cambridge,MA: MIT Press.
[Mitchell] Mitchell, T.M. (1997). Machine learning.New York: McGraw-Hill
[Patterson] Patterson, D.W. (1996). Artificial neural networks: Theory and applications. Singapore: Prentice Hall , c1996.
[Nisbet] Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of statistical analysis and data mining applications.Amsterdam: Elsevier. ISBN 978-0-12-374765-5
[Tabachnick] Tabachnick, B.G., & Fidell, L.S. (2001). Using multivariate statistics (4th ed.).Boston: Allyn and Bacon.
[Lattin] Lattin, J.M., Carroll, J.D., & Green, P.E. (2003). Analyzing multivariate data.Pacific Grove,CA: Brooks/Cole – Thomson Learning.
Lecture Schedule
Week 1 Tue, 15 Aug |
Lecture: Introduction Data cleaning & preparation Introduction to survey data set for Assignment 1 & 2 Lab: Basic data analysis with MS Excel Reading: [Linoff] chap. 1 & 2 |
Week 2
Tue, 22 Aug |
Lecture: Basic statistical analysis
Lab: Basic statistical analysis with SPSS Readings: [Linoff] chap. 4 [SPSS] chap. 8-12 |
Week 3
Tue, 29 Aug |
Lecture: Basic statistical analysis, part 2
Lab: SPSS, part 2 Readings: [Linoff] chap. 3 [SPSS] chap. 20 |
Week 4
Tue, 5 Sep |
No class – Chris is away attending ISIC2012 conference in Tokyo
Readings: [Linoff] chap. 5 |
Week 5
Tue, 12 Sep |
Lecture: Regression analysis
Lab: SPSS: Regression analysis Readings: [Linoff] chap. 6 [SPSS] chap. 15-17 Handout |
Week 6
Tue, 19 Sep |
Lecture: Logistic regression
Lab: SPSS: Cross-tabulation, Chi-square test of independence, Logistic regression Readings: [Linoff] chap. 18-19 Handout |
Week 7
Tue, 26 Sep |
Lecture: Principle components analysis
Lab: SPSS: Principle components analysis Readings: [Linoff] chap. 20 Handout |
Recess
Tue, 3 Oct |
Make-up Class
Lecture: Association rule mining Lab: Introduction to data mining software SPSS Clementine Lab: SPSS Clementine: Association rule mining Readings: [Linoff] chap. 15
|
Week 8
Tue, 10 Oct |
Mid-term test
Assignment 1 (statistical analysis) due |
Week 9
Tue, 17 Oct |
Lecture: Decision tree induction
Lab: SPSS Clementine: Decision tree induction Readings: [Linoff] chap. 7 |
Week 10
Tue, 24 Oct |
Lecture: Evaluation of classifiers
Lab: SPSS Clementine: Evaluation & comparison of models Readings: [Linoff] chap. 2 (pp. 47-54), chap. 5 step 8 |
Week 11
Tue, 31 Oct |
Lecture: Neural network modelling
Lab: SPSS Clementine: Neural Network modelling Readings: [Linoff] chap. 8 |
Week 12
Tue, 7 Nov |
Lecture: Cluster analysis
Lab: SPSS Clementine: Cluster analysis Readings: [Linoff] chap. 12-13 Handout |
Week 13
Tue, 14 Nov |
Assignment 2 (machine learning exercise) due
Review & Student Presentations |