Syllabus

K6312: INFORMATION MINING & ANALYSIS
Academic Year 2012-2013, Semester 1

Course Description

This course examines the main techniques used for extracting information from large amounts of numerical data, as well as the broader process of knowledge discovery. The information extracted in data mining is often in the form of associations, patterns or specific facts hidden in the mass of data. The extracted information can be synthesized to create new knowledge, or used to construct prediction models or complex knowledge structures. The information/knowledge extracted should be useful in a tangible way to increase the profitability, efficiency, effectiveness or competitive advantage of an organization.

Data mining techniques can be divided into three types:

Statistical techniques, including hypothesis testing, Chi-square test, t-test, multiple regression analysis and logistic regression
Machine learning, including clustering, K-nearest neighbour, decision-tree induction, association rule induction, Bayesian learning and neural networks
Natural language processing, including lexical, syntactic and semantic analysis.

Application of data mining in customer relationship management will be examined in the course.

Unfortunately, in this course there is not enough time to look into text mining and natural language processing. A separate course on this topic is being planned. Time series analysis, data warehousing and On-Line Analytical Processing (OLAP) will also not be covered this semester.

The class will be run partly in a project-based/problem-based style. Lab sessions and the class project will provide students with hands-on experience with data mining software. The approach taken in the course is a how-to-do-it, how-does-it-work and how-to-apply-it kind of approach, with a minimum of mathematics. The course will seek to develop the student’s commonsense ability to examine data from different angles and apply data mining techniques intelligently to tease out useful patterns in the data.

Objectives

At the end of the course, students are expected to:

Understand the principles and concepts underlying the main data mining techniques, and their strengths and limitations;
Apply data mining techniques and the knowledge discovery process to discover hidden information in numerical data;
Understand the different kinds of patterns and models that can be extracted from a data set, and be able to select and use an appropriate technique for each type of pattern and model;
Be able to interpret and evaluate the results of data mining;
Describe how data mining can be used in real-life applications;
Know the main features and functionalities that a good data mining and text mining tool should have.

Prerequisites

Students must have taken at least 1 semester of statistical analysis at the undergraduate level., and be comfortable analyzing data using a spreadsheet programme.

This course does not require a lot of reading. However, students must be prepared to spend sufficient time thinking and grappling with new data analysis concepts, and analyzing datasets.

The data analysis softwares used in the course (SPSS Statistical Software & SPSS Clementine) are available only in the school labs, and not available for student loan. Students have to be prepared to come to school and work in the lab to complete the assignments.

Attendance is compulsory in the first half of the semester (up till the mid-semester break). You’re likely to do poorly in the final exam if you miss even a single class.

Method of Assessment

Grading

Class attendance, participation, homework: 5% of final grade
Group presentation: 5%
Mid-term test: 10%
Assignment 1: Statistical analysis of a dataset – 15%
Assignment 2: Analysis of a dataset using machine-learning techniques – 15%
Final exam – 50% (closed book exam)

Assignment 1. Statistical analysis exercise

Carry out statistical analyses on a data set to be provided by the instructor.
Do either a group presentation for Assignment 1 or for Assignment 2.
Write and submit a report of 8-10 pages, excluding appendices.

Assignment 2. Analysis using machine-learning

Analyze the same data set using machine learning techniques and compare the results with the results from the statistical analyses.
Do a group presentation, if you haven’t done so for Assignment 1.
Write and submit a report of 8-10 pages, excluding appendices.

Readings

Recommended Texts

code book title

[Linoff] Linoff, G.S., & Berry, M.J.A. (2011). Data mining techniques for marketing, sales, and customer relationship management (3^rd ed.).Indianapolis: Wiley. (ISBN 978-0-470-65093-6)

[SPSS] Carver, R.H., & Nash, J.G. (latest edition). Doing data analysis with SPSS.Belmont,CA: Brooks/Cole.

Other useful texts

[Tan] Tan, Pang-Ning, Steinbach, M., & Kumar, V. (2006). Introduction to data mining.Boston. Pearson/Addison Wesley. ISBN 0-321-32136-7.

[Witten] Witten, I.H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2^nd ed.).Amsterdam: Elsevier.

[Westphal] Westphal, C.R. (1998). Data mining solutions: Methods and tools for solving real-world problems.New York: Wiley.

[Manning] Manning, C.D., & Schutze, H. (2000). Foundations of statistical natural language processing.Cambridge,MA: MIT Press.

[Mitchell] Mitchell, T.M. (1997). Machine learning.New York: McGraw-Hill

[Patterson] Patterson, D.W. (1996). Artificial neural networks: Theory and applications. Singapore: Prentice Hall , c1996.

[Nisbet] Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of statistical analysis and data mining applications.Amsterdam: Elsevier. ISBN 978-0-12-374765-5

[Tabachnick] Tabachnick, B.G., & Fidell, L.S. (2001). Using multivariate statistics (4^th ed.).Boston: Allyn and Bacon.

[Lattin] Lattin, J.M., Carroll, J.D., & Green, P.E. (2003). Analyzing multivariate data.Pacific Grove,CA: Brooks/Cole – Thomson Learning.

Lecture Schedule

Week 1
Tue, 15 Aug

Lecture: Introduction
Data cleaning & preparation
Introduction to survey data set for Assignment 1 & 2

Lab: Basic data analysis with MS Excel

Reading: [Linoff] chap. 1 & 2

Week 2

Tue, 22 Aug

Lecture: Basic statistical analysis

Lab: Basic statistical analysis with SPSS

Readings: [Linoff] chap. 4

[SPSS] chap. 8-12

Week 3

Tue, 29 Aug

Lecture: Basic statistical analysis, part 2

Lab: SPSS, part 2

Readings: [Linoff] chap. 3

[SPSS] chap. 20

Week 4

Tue, 5 Sep

No class – Chris is away attending ISIC2012 conference in Tokyo

Readings: [Linoff] chap. 5

Week 5

Tue, 12 Sep

Lecture: Regression analysis

Lab: SPSS: Regression analysis

Readings: [Linoff] chap. 6

[SPSS] chap. 15-17

Handout

Week 6

Tue, 19 Sep

Lecture: Logistic regression

Lab: SPSS: Cross-tabulation, Chi-square test of independence, Logistic regression

Readings: [Linoff] chap. 18-19

Handout

Week 7

Tue, 26 Sep

Lecture: Principle components analysis

Lab: SPSS: Principle components analysis

Readings: [Linoff] chap. 20

Handout

Recess

Tue, 3 Oct

Make-up Class

Lecture: Association rule mining

Lab: Introduction to data mining software SPSS Clementine

Lab: SPSS Clementine: Association rule mining

Readings: [Linoff] chap. 15

Week 8

Tue, 10 Oct

Mid-term test

Assignment 1 (statistical analysis) due

Week 9

Tue, 17 Oct

Lecture: Decision tree induction

Lab: SPSS Clementine: Decision tree induction

Readings: [Linoff] chap. 7

Week 10

Tue, 24 Oct

Lecture: Evaluation of classifiers

Lab: SPSS Clementine: Evaluation & comparison of models

Readings: [Linoff] chap. 2 (pp. 47-54), chap. 5 step 8

Week 11

Tue, 31 Oct

Lecture: Neural network modelling

Lab: SPSS Clementine: Neural Network modelling

Readings: [Linoff] chap. 8

Week 12

Tue, 7 Nov

Lecture: Cluster analysis

Lab: SPSS Clementine: Cluster analysis

Readings: [Linoff] chap. 12-13

Handout

Week 13

Tue, 14 Nov

Assignment 2 (machine learning exercise) due

Review & Student Presentations

Tue, 5 Sep

Tue, 12 Sep

Tue, 19 Sep

Tue, 26 Sep

Tue, 10 Oct

Tue, 17 Oct

Tue, 31 Oct

Tue, 7 Nov

Recent Posts

Recent Articles & News

Meta

Categories

Archives

Library Tip: OneSearch