Big Data

Slides

The slides of the lectures can be downloaded from here

09-02: an introduction to Big Data and Data Mining
11-02: the effects of Big on Data Analysis
16-02: we have seen that sampling is a means to allow mining very large data sets. The mining problem for which we will study this in depth in this course is Frrequent Pattern Mining. Today we introduce this topic.
18-02: Now that we know frequent pattern mining it is time to start analysing how we can sample for that task. We start by introducing a sampling mechanism introduced by Toivonen and we discuss its pro's and con's. In this discission we see a link between frequent pattern mining and classification. Because of that we introduce our second class of Data Mining problems we look at, viz., classification
23-02: The previous time we saw that there are infinitely large hypothesis classes that can not be learned. This raises the question: how about finite classes? In this lecture we will see that finite classes can always be learned -- if we assume that the set of hypotheses contains the true classifier. Moreover, we compute a minimal sample size for that task. We then turn thus result inside out to arrive at our first definition of PAC learning.
25-02: We know that we can PAC learn finite classes, but how do you do that in concrete cases. We introduce some clases of boolean functions and devise algorithms to learn them. Along the way we learn that some finite classes, probably, can not be learned efficiently. Then we generalise our definition of PAC learning to arrive at the definition that is best known in the literature
02-03: We now have a general definition of PAC learning, but what classes of hypotheses can be learned under that definition? We first show that finite classes can still be learned. Then we see that some infinite classes cannot be learned. By contrasting a no free lunch theorem with threshold classifiers we come up with the notion of the VC dimension. The expectation is that classes with a finite VC dimension can be learned.
04-03: In this lecture we compute the VC dimension of, again, some classes of boolean functions and polynomials used as classifiers. We see that in some cases it is, relatively, easy to compute the VC dimension exactly. In others it is far harder and computing a bound is about all we can do.
09-03: In this lecture we prove that our expectation was well-founded: hypothesis classes with a finite VC dimension are those that can be PAC learned. Moreover, we derive bounds for the sample size, both depending on the VC dimension and depending on the growth function.
11-03: Now that we know the sampling bounds that the fundamental theorem gives us, it is time to return to frequent itemset mining. In this lecture we discuss the papers by Riondato and Upfal in which the authors derive sampling bounds for frequent itemset mining based on PAC learning.
Possible Bonus topic: PAC learning assumes that the sample size is uniformly the same for all hypotheses? What happens if we loosen that constraint: if we allow different sample sizes for different hypotheses? Does that allow us to learn more? In this lecture we show that this leads to Structural Risk Minimization, a tower of PAC learnable classes. Moreover, we discuss MDL as a way to do this objectively.
Possible Bonus Topic: The other constraint we consider to loosen is that of strong learners. What if we are happy with a weak learner, one that is just better than a coin toss. In this lecture we discuss how a procdure called boosting turns weak learners into strong learners. That is, we get back into the realm of PAC learning. PAC learning is a rational approach and using it for bounds for frequent itemset mining is a reasoanble choice.