Exam Data Mining January 11, 2023, 17.00-19.30 hrs Question 1 MIXED SHORT QUESTIONS Part 1 (Bagging and Random Forests) (a) FALSE (the order of examples doesn't matter) (b) FALSE (sample per split, not per tree) (c) TRUE (d) TRUE Part 2 (Frequent Pattern Mining) (a) TRUE (b) FALSE (3 times) (c) FALSE (candidates are generated by rightmost extension) (d) TRUE Part 3 (Classification Trees) If we predict class c, the probability of making a wrong prediction is 1-p(c|t). We predict class c with probability p(c|t), so the probability of making an error is \Sum_{c=1}^C p(c|t)(1-p(c|t)), which is the formula for the gini-index. Part 4 (Graphical Models) The correct answer is (b). Model (b) fits the independence model, and x_1 and x_2 are exactly independent in the data, so the independence model gives a perfect fit of the observed counts. The saturated model (c) fits equally well, but uses one more parameter, so has a worse BIC score. Model (a) fits the uniform table of counts 25,25,25,25. It has two less parameters than the independence model, but a much worse fit. Part 5 (Logistic Regression) The correct answer is (d). The odds are multiplied by exp(0.14) which is approximately 1.15. Multiplication by 1.15 is the same as an increase of 15%. Question 2 CLASSIFICATION TREES (a) i(t1) = 1/2 * 1/2 = 1/4. i(t2) = i(t3) = 1/5 * 4/5 = 4/25. Reduction is 1/4-(1/2*4/25+1/2*4/25)=9/100. (b) The smallest minimizing subtree (SMS) for a1=0 is obtained by pruning in t2. Next we compute g(t1) = 21/100 and g(t3)=2/100. We prune in t3 and set a2=2/100. Next we recompute g(t1)=3/10, and set a2=3/10. Summarizing: T1 is obtained by pruning in t2, and it is the SMS for a in [0,2/100). T2 is obtained by pruning T1 in node t3, and it is the SMS for a in [2/100,3/10). The root node is the SMS for a >= 3/10. (c) sqrt(2/100 * 3/10) = 0.077 (approximately) Question 3 CLOSED FREQUENT ITEM SET MINING (a) LEVEL 1: LEVEL 2: sup gen? sup gen? A 2 v AB 2 x B 4 v AC 2 x C 5 v AD 0 x D 3 v BC 4 x E 1 x BD 2 v CD 3 x E and AD are pruned due to insufficient support, all other pruning because the itemset has a subset with the same support. (b) gen closure sup A ABC 2 B BC 4 C C 5 D CD 3 BD BCD 2 Question 4 BAYESIAN NETWORKS (a) Yes, the resulting model has the same skeleton and v-structures. (b) The current score of lipo is -1208. After deleting the edge mental --> lipo, lipo has one parent left, which is smoke. The score of lipo then becomes: 598 log 598/961 + 363 log 363/961 + 463 log 463/880 + 417 log 417/880 = -1245.855, which we will round to -1246. The change in log-likelihood score is -1246 + 1208 = -38, so a decrease of 38. (c) The penalty per parameter is log(1841)/2 = 3.76. Deleting the edge mental --> lipo reduces the number of parameters by 2, so the change in BIC score is -38 + 2 * 3.76 = -30.48, or -30 after rounding. The BIC score decreases by 30. Question 5 MULTINOMIAL NAIVE BAYES (a) |V|=10 P(good|Pos) = (2+1)/(8+10) = 1/6 P(good|Neg) = (0+1)/(8+10) = 1/18 P(teacher|Pos) = (1+1)/(8+10) = 1/9 P(teacher|Neg) = (2+1)/(8+10) = 1/6 (b) The class priors are P(Pos) = P(Neg) = 1/2 We ignore "very" in the test document, because it didn't occur in the training set P(Pos)P(good teacher|Pos) = 1/2 * 1/6 * 1/9 = 1/108 P(Neg)P(good teacher|Neg) = 1/2 * 1/18 * 1/6 = 1/216 P(Pos|good teacher) = (1/108) / (1/108 + 1/216) = 2/3 (c) One training document of class A containing the word "a", and one training document of class B containing the word "b" Test document containing the words "a" and "b".