Question 1 P(y=0|x=A) = 0 P(y=0|x=B) = 0.75 P(y=0|x=C) = 1 P(y=0|x=D) = 0.33 Sort values: A D B C (a) x in {A}, x in {A,B,D}, x in {A,D} (b) Impurity of node: 0.5 * 0.5 = 0.25 Impurity of x in {A,D}: 0.2 * 0.8 = 0.16 Impurity of x in {B,C}: 0.8 * 0.2 = 0.16 Proportion x in {A,D}: 0.5 Proportion x in {B,C}: 0.5 Impurity reduction: 0.25 - (0.5*0.16+0.5*0.16) = 0.09 Twice this number was also correct. Question 2 (a) Prune in t_2. (b) g(t_1) = (0.5 - 0.1)/(3 - 1) = 0.2 g(t_3) = (0.1 - 0)/(2 - 1) = 0.1 So a_2 = 0.1 (c) The resubstitution error in the root node is at most 0.5 for binary classification problems. So the worst case total cost of the root is 0.5 + a. The best case for a tree that is not the root node is a tree with a single split and zero resubstitution error. The cost of that tree is 0 + 2a. For a = 0.5 both trees have the same total cost, and the root node is the smallest minimizing subtree (and of course remains the smallest minimizing subtree for a > 0.5) Question 3 (a) n(A,B,C)n(C,D)n(C,E)/ n(C)^2 (b) 12 Question 4: add(A --> C), add(D --> C) Question 5 (a) The independence model gives a perfect fit of the observed counts, so: 32,48,8,12. (b) Since the independence model gives a perfect fit of the observed counts, it has the same loglikelihood score as the saturated model x --> y. But the latter uses one more parameter to achieve that fit, at the cost of log(100)/2 = 2.3. So the difference in BIC score is -2.3. Question 6 |V| = 11, Number of words in positive reviews: 12 P(beautiful|Positive) = (2+1)/(12+11) = 3/23 Question 7 ACD, BDE, ABD Question 8 (a) BC, BD, CD (b) ABC, BC, BCD Question 9 Authors that did not write a paper together in 2021, but did in 2022. So AD and BD. Question 10 Let xyz denote an arbitrary frequent sequence of length k+1, where x is a sequence of length k-1 (possibly the empty sequence if k=1), and y and z are symbols from the set of labels. By the anti-monotonicity property of support, every subsequence of xyz is frequent. xy and xz are both length-k subsequences of xyz, and therefore frequent. The GSP candidate generation rule combines any two frequent sequences A and B of length k that match on the first k-1 positions to the level k+1 candidate A[1:k]+B[k]. Since xy and xz match on the first k-1 positions, they will be combined to the candidate xyz.