GETTING STARTED WITH THE ASSIGNMENT IN R

We start with reading the credit scoring data from the lectures in R.
The data can be found here.
Suppose you saved the data to a file "credit.txt" in the directory "dm"
on the C drive. To read it into R type (">" denotes the R prompt):


> credit.dat <- read.csv("C:/dm/credit.txt")

You have now assigned this data set to a variable called "credit.dat"
(" <- " is the assignment symbol in R).

To display its value, just type its name at the command line:

> credit.dat
   age married house income gender class
1   22       0     0     28      1     0
2   46       0     1     32      0     0
3   24       1     1     24      1     0
4   25       0     0     27      1     0
5   29       1     1     32      0     0
6   45       1     1     30      0     1
7   63       1     1     58      1     1
8   36       1     0     52      1     1
9   23       0     1     40      0     1
10  50       1     1     28      0     1

(the first column are row numbers, and the first row are column names)

"credit.dat" is now an object of type "data.frame". This is similar to (but subtly
different from) a matrix. In any case, you can index a data frame like a matrix.

Select the first row of credit.dat:

> credit.dat[1,]
  age married house income gender class
1  22       0     0     28      1     0

Select the fourth column of credit.dat:

> credit.dat[,4]
 [1] 28 32 24 27 32 30 58 52 40 28

Select the element in row 5, column 1:

> credit.dat[5,1]
[1] 29

Give the distinct values of income, sorted from low to high:

> sort(unique(credit.dat[,4]))
[1] 24 27 28 30 32 40 52 58


Add all the entries of the sixth column:

> sum(credit.dat[,6])
[1] 5

Add the entries of each column of credit.dat:

> apply(credit.dat,2,sum)
    age married   house  income  gender   class 
    363       6       7     351       5       5 

Add the entries of each row:

> apply(credit.dat,1,sum)
  1   2   3   4   5   6   7   8   9  10 
 51  79  51  53  63  78 125  91  65  81



Select all rows where the first column is bigger than 27:

> credit.dat[credit.dat[,1] > 27,]
   age married house income gender class
2   46       0     1     32      0     0
5   29       1     1     32      0     0
6   45       1     1     30      0     1
7   63       1     1     58      1     1
8   36       1     0     52      1     1
10  50       1     1     28      0     1

Construct a vector "x" with the numbers 2,5,10 in that order:

> x <- c(2,5,10)
> x
[1] 2  5  10

Construct a vector consisting of the numbers 1 through 10:

> c(1:10)
 [1]  1  2  3  4  5  6  7  8  9 10


Select the *row numbers* of the rows where the first column of credit.dat is bigger than 27:

> c(1:10)[credit.dat[,1] > 27]
[1]  2  5  6  7  8 10

Draw a random sample of size 5 from the numbers 1 through 10 (without replacement):

> index <- sample(10,5)

> index
[1] 5 1 7 4 6

Select the corresponding rows:

> train <- credit.dat[index,]
> train
  age married house income gender class
5  29       1     1     32      0     0
1  22       0     0     28      1     0
7  63       1     1     58      1     1
4  25       0     0     27      1     0
6  45       1     1     30      0     1

Select all rows with row number not in "index":

> test <- credit.dat[-index,]
> test   

   age married house income gender class
2   46       0     1     32      0     0
3   24       1     1     24      1     0
8   36       1     0     52      1     1
9   23       0     1     40      0     1
10  50       1     1     28      0     1

Consult the help page of the function "sample"

> help(sample)


At the end of a session (and also during a session), save your workspace to a file (choose
"Save Workspace" from the file menu). Otherwise all results (the functions you created, etc.)
will be lost after you quit R.

Practice exercise 1
Assume we have a classification problem with only 2 classes that are labeled 0 and 1 respectively.
Write a function that computes the impurity of a vector (of arbitrary length) of class labels.
Use the gini-index as impurity measure. Do not use a loop structure in your function,
this is not necessary.

Example:

> y <- c(1,0,1,1,1,0,0,1,1,0,1)
> y
 [1] 1 0 1 1 1 0 0 1 1 0 1

> impurity(y)
[1] 0.2314050


If you are not working in Rstudio, to create the function, use:

> fix(impurity, editor="Notepad")


This will open a Notepad window. Type in the function definition, save the file and exit the editor.


Practice exercise 2
Write a function "bestsplit(x,y)" that computes the best split value on a numeric attribute x.
Here x is a vector of numeric values, and y is the vector of class labels (assume there
are only two classes, coded as 0 and 1). x and y must be of the same length: y[i] is the class
label of the i-th observation, and x[i] is the corresponding value of attribute x.
Only consider splits of type "x <= c" where "c" is the average of two consecutive values of x in the sorted order.
So one child contains all elements with "x <= c" and the other child contains all elements with "x > c".
The best split is the split that achieves the highest impurity reduction.

Example (best split on income):

> bestsplit(credit.dat[,4],credit.dat[,6])
[1] 36

Hint: Clever use of "subscripting" (selecting elements of vectors and matrices) is important in R. For example,
      y[x > 29] produces a vector with all elements of y whose corresponding x-element (that is the element of x with
      the same index) is bigger than 29. More formally: y[x > 29] = {y[i]: x[i] > 29}. The result is a vector, not a set, i.e.
      duplicate values may occur. Just try it!

Hint: Example of how to determine candidate split points

  > income.sorted <- sort(unique(credit.dat[,4]))
  > income.sorted
   [1] 24 27 28 30 32 40 52 58
  > income.splitpoints <- (income.sorted[1:7]+income.sorted[2:8])/2
  > income.splitpoints
   [1] 25.5 27.5 29.0 31.0 36.0 46.0 55.0

Note: use the "brute force" approach, i.e. don't implement the "segment borders" algorithm.