This blog is for the course K6225 Knowledge Discovery & Data Mining in the MSc Knowledge Management programme offered in the School of Communication & Information. The blog was set up for 3 purposes:
- pedagogical purpose: to be an additional channel for the instructor to provide general advice/observations on each week’s topic as well as answer specifc questions posted by students on the course twitter site.
- student recruitment: to give potential students a glimpse or taste of what goes on in the class and what students learn
- experiment with social media: to explore how social media can be used to enhance student learning.
Students learn how to construct a prediction model incrementally using stepwise linear regression. They also learn:
- about data preparation, and converting categorical variables into a set of dummy variables with numerical values of 0 and 1
- the concept of interaction: 2 factors interact if the effect of 1 factor in making a prediction, depends on the value of another factor
- the concept of partial F-tests — to check that each factor in the model contributes significantly to the accuracy of the model, over and above the rest of the factors in the model
- how to interpret the output from the SPSS statistical software.
Students start work on assignment 1. They start out exploring the data visually and using univariate analysis. They wonder what the purpose of univariate analysis is!
In the lab session, students are introduced to the SPSS statistical software. This is used to carry out bivariate analysis:
- between a quantitative variable (salary) and categorical variable (e.g. type of library) using t-test
- between two quantitative variables (e.g. salary vs. age) using Pearson r (correlation coefficient)
- between two categorical variables (type of library vs. qualification) using cross-tab and Chi-square test of independence.
The lecture session reviewed 3 important statistical concepts that students should have some familiarity with from their under grad stats class:
– normal distribution, standard deviation and standardised scores
– sampling distribution and interval estimation
– hypothesis testing and p-value
Students looked like they were having a headache at the end of class.
Some international students asked me whether they needed to do any calculation by hand! I told them they just need to know which icon to click on the software and interpret the output! Welcome to the 21st century!
I also told students they don’t have to memorise anything for the exam, except that 1SD is associated with 68% of the population, 2SD with 95% and 3SD with 99%. Not sure what they make of this!
The next class will feature an “American Idol” competition — data mining version!
The first few labs will make use of Salary Survey data from the Library Association to learn basic statistical analysis. In the first lab, students make use of MS Excel to explore bivariate relations — between annual income (target variable) and independent variables such as age, years in the profession, type of library, type of job roles, qualification, etc. The Pivot Table function is used for exploratory data analysis. This is similar to cross tabulation, and can be seen as a simple form of OLAP (Online Analytical Processing). Students also learn about preparing data for analysis.
Students will tell you that this is a difficult but useful course — possibly the most difficult course in the KM programme. I think students find the course difficult for 2 reasons:
- Other courses in the KM programme are management oriented, so students are not used to the hands-on, practical skills nature of the course. We have continued to maintain the course in the KM programme because faculty think it is important for students to have some practical skills — to be more attractive to employers.
- Students are mentally lazy at the end of the day, and don’t want to think very hard! (Sorry!)
This course does not require a lot of reading. However there are many weird counter-intuitive concepts to learn, and new ways of thinking about data. Students have to mentally wrestle with new data analysis concepts week after week — until they have a headache. Students tell me that, often, they think they understand a concept when I explain it in class, but find that they can’t remember or understand it later. That’s the nature of the subject. One has to grapple with each concept 4 or 5 times before it becomes familiar and commonsensical! My advice to students is to review lecture material right after class (on the train home), and again on the way to class the following week. I usually spend the first half hour of each class reviewing the previous week’s material.
Students also complain that they feel lost in the first half of the semester (when they’re learning statistical analysis). That is expected — “no pain, no gain”! I’ve scheduled a mid-term so that students can consolidate what they’ve learnt and have a sense of attainment after the mid-term. Most students do get there and become competent in data analysis and pass the course — but they do complain a lot along the way!
Actually, such is the life of professional data miners. To be successful in data mining you have to immerse yourself in the data, and wrestle with the data all the time — to find that useful pattern or business idea that will help your organisation.
Prerequisite for the course: Students must have taken at least 1 semester of statistical analysis at the undergraduate level, and be comfortable analyzing data using a spreadsheet programme. (This is on advice of previous students–so that the concepts covered in the first few weeks are at least faintly familiar you.)
Read [Linoff & Berry] chap. 1 & 2. Reading is like text mining — you should not memorise every line. Skim the text to look for “good stuff”, i.e. read purposefully. But what is the purpose of reading [Linoff & Berry]? To help you do the Homework!
Homework: Discussion board posting: Post a 1 page (<1000 word) discussion of one way in which data mining can be applied in your current (or past or future) organisation. Your discussion should cover the following stages of the data mining cycle as outlined in the reading:
- Identify business opportunities
- Mine data to transform data into actionable information
- Act on the information
- Measure the results
As I will explain in class, the purpose of this homework is not to make your life difficult, but to help you get a job! When you go for your job interview, you’ll be asked how you think data mining can be applied in the company (in other words, why hire you?). You should be prepared to outline various ways in which data mining can be useful, as well as discuss in detail 1 particularly promising way.
There are two data mining courses offered in the School:
- a technical course offered in the MSc Information Systems programme (CI6227 Data Mining)
- a practical course (this course, K6225) using a how-to-do-it, how-does-it-work and how-to-apply-it kind of approach, with a minimum of mathematics.
This course seeks to develop the student’s commonsense ability to manipulate data from different angles. When I first taught this course more than 10 years ago, I focused on teaching methods and techniques, expecting students to be able to use commonsense to apply them. I was horrified at the end of the semester to find in the term reports and exam answers that students had many misconceptions and was applying the methods incorrectly. I gradually learnt then “commonsense” is actually uncommon, and that data analysis is an art, requiring knowledge, skill and creativity. The course now adopts a more problem-based approach where students analyse a particular dataset throughout the course of the semester — each week applying the technique they have learnt in class. Every week, 2 or 3 groups of students give a 3 minute presentation of their data analysis results — so that I can point out misconceptions, how the analysis can be improved, and subtleties not highlighted in the lecture material.
This semester, I’m experimenting with social media to supplement classroom interaction. A blog and twitter account will be set up for students to send comments, questions and reflections. This is in addition to the discussion forum in EdveNTUre.
The first half of the semester is devoted to statistical analysis, and the second half to machine-learning methods. This is because there is no separate statistics course in the KM programme, and I think it is dangerous to go into an organisation to do data mining without knowing basic statistical analysis.
PhD students have found this a good substitute for a stats course. I must caution students though the course doesn’t cover experimental design and Analysis of Variance, which PhD students doing quantitative research should know. (Courses on ANOVA are available in the Psychology Division.)