GETTING STARTED WITH THE ASSIGNMENT IN Python

We start with reading the credit scoring data from the lectures in Python. The data can be found here. Suppose you saved the data to a file "credit.txt" in the directory "dm" on the C drive. To read it into Python type (">>>" denotes the prompt): >>> import numpy as np >>> credit_data = np.genfromtxt('C:/dm/credit.txt', delimiter=',', skip_header=True) To display its value, just type its name at the command line: >>> credit_data array([[22., 0., 0., 28., 1., 0.], [46., 0., 1., 32., 0., 0.], [24., 1., 1., 24., 1., 0.], [25., 0., 0., 27., 1., 0.], [29., 1., 1., 32., 0., 0.], [45., 1., 1., 30., 0., 1.], [63., 1., 1., 58., 1., 1.], [36., 1., 0., 52., 1., 1.], [23., 0., 1., 40., 0., 1.], [50., 1., 1., 28., 0., 1.]]) "credit_data" is now a 2d NumPy array. Each rows represent a record and the columns represent the data attributes. Select the first row of credit_data: >>> credit_data[0] array([22., 0., 0., 28., 1., 0.]) Select the fourth column of credit_data: >>> credit_data[:,3] array([28., 32., 24., 27., 32., 30., 58., 52., 40., 28.]) Select the element in row 4, column 0: >>> credit_data[4,0] 29.0 Give the distinct values of income, sorted from low to high: >>> np.sort(np.unique(credit_data[:,3])) array([24., 27., 28., 30., 32., 40., 52., 58.]) Add all the entries of the sixth column: >>> np.sum(credit_data[:,5]) 5.0 Add the entries of each column of credit_data: >>> credit_data.sum(axis=0) array([363., 6., 7., 351., 5., 5.]) Add the entries of each row: >>> credit_data.sum(axis=1) array([ 51., 79., 51., 53., 63., 78., 125., 91., 65., 81.]) Select all rows where the first column is bigger than 27: >>> credit_data[credit_data[:,0] > 27] array([[46., 0., 1., 32., 0., 0.], [29., 1., 1., 32., 0., 0.], [45., 1., 1., 30., 0., 1.], [63., 1., 1., 58., 1., 1.], [36., 1., 0., 52., 1., 1.], [50., 1., 1., 28., 0., 1.]]) Construct a vector "x" with the numbers 2, 5, 10 in that order: >>> x = np.array([2, 5, 10]) >>> x array([ 2, 5, 10]) Construct a vector consisting of the numbers 0 through 9: >>> np.arange(0, 10) array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) Select the *row numbers* of the rows where the first column of credit_data is bigger than 27: >>> np.arange(0, 10)[credit_data[:,0] > 27] array([1, 4, 5, 6, 7, 9]) Draw a random sample of size 5 from the numbers 1 through 10 (without replacement): >>> index = np.random.choice(np.arange(0, 10), size=5, replace=False) >>> index array([5, 7, 1, 3, 8]) Select the corresponding rows: >>> train = credit_data[index,] >>> train array([[45., 1., 1., 30., 0., 1.], [36., 1., 0., 52., 1., 1.], [46., 0., 1., 32., 0., 0.], [25., 0., 0., 27., 1., 0.], [23., 0., 1., 40., 0., 1.]]) Select all rows with row number not in "index": (This does not delete any rows from the original credit_data.) >>> test = np.delete(credit_data, index, axis=0) >>> test array([[22., 0., 0., 28., 1., 0.], [24., 1., 1., 24., 1., 0.], [29., 1., 1., 32., 0., 0.], [63., 1., 1., 58., 1., 1.], [50., 1., 1., 28., 0., 1.]]) Consult the help page of the function "np.random.choice" >>> help(np.random.choice)

Practice exercise 1

Assume we have a classification problem with only 2 classes that are labeled 0 and 1 respectively. Write a function that computes the impurity of a vector (of arbitrary length) of class labels. Use the gini-index as impurity measure. Do not use a loop structure in your function, this is not necessary. Example: >>> array=np.array([1,0,1,1,1,0,0,1,1,0,1]) >>> array array([1,0,1,1,1,0,0,1,1,0,1]) >>> impurity(array) 0.23140495867768596

Practice exercise 2

Write a function "bestsplit(x,y)" that computes the best split value on a numeric attribute x. Here x is a vector of numeric values, and y is the vector of class labels (assume there are only two classes, coded as 0 and 1). x and y must be of the same length: y[i] is the class label of the i-th observation, and x[i] is the corresponding value of attribute x. Only consider splits of type "x <= c" where "c" is the average of two consecutive values of x in the sorted order. So one child contains all elements with "x <= c" and the other child contains all elements with "x > c". The best split is the split that achieves the highest impurity reduction. Example (best split on income): >>> bestsplit(credit_data[:,3],credit_data[:,5]) 36 Hint: Clever use of "subscripting" (selecting elements of vectors and matrices) is important. For example, y[x > 29] produces a vector with all elements of y whose corresponding x-element (that is the element of x with the same index) is bigger than 29. More formally: y[x > 29] = {y[i]: x[i] > 29}. The result is a vector, not a set, i.e. duplicate values may occur. Just try it! Hint: Example of how to determine candidate split points >>> income_sorted = np.sort(np.unique(credit_data[:,3])) >>> income_sorted array([24, 27, 28, 30, 32, 40, 52, 58]) >>> income_splitpoints = (income_sorted[0:7]+income_sorted[1:8])/2 >>> income_splitpoints array([25.5, 27.5, 29. , 31. , 36. , 46. , 55. ]) Note: use the "brute force" approach, i.e. don't implement the "segment borders" algorithm.