GETTING STARTED WITH THE ASSIGNMENT IN R
We start with reading the credit scoring data from the lectures in R.
The data can be found here.
Suppose you saved the data to a file "credit.txt" in the directory "dm"
on the C drive. To read it into R type (">" denotes the R prompt):
> credit.dat <- read.csv("C:/dm/credit.txt")
You have now assigned this data set to a variable called "credit.dat"
(" <- " is the assignment symbol in R).
To display its value, just type its name at the command line:
> credit.dat
age married house income gender class
1 22 0 0 28 1 0
2 46 0 1 32 0 0
3 24 1 1 24 1 0
4 25 0 0 27 1 0
5 29 1 1 32 0 0
6 45 1 1 30 0 1
7 63 1 1 58 1 1
8 36 1 0 52 1 1
9 23 0 1 40 0 1
10 50 1 1 28 0 1
(the first column are row numbers, and the first row are column names)
"credit.dat" is now an object of type "data.frame". This is similar to (but subtly
different from) a matrix. In any case, you can index a data frame like a matrix.
Select the first row of credit.dat:
> credit.dat[1,]
age married house income gender class
1 22 0 0 28 1 0
Select the fourth column of credit.dat:
> credit.dat[,4]
[1] 28 32 24 27 32 30 58 52 40 28
Select the element in row 5, column 1:
> credit.dat[5,1]
[1] 29
Give the distinct values of income, sorted from low to high:
> sort(unique(credit.dat[,4]))
[1] 24 27 28 30 32 40 52 58
Add all the entries of the sixth column:
> sum(credit.dat[,6])
[1] 5
Add the entries of each column of credit.dat:
> apply(credit.dat,2,sum)
age married house income gender class
363 6 7 351 5 5
Add the entries of each row:
> apply(credit.dat,1,sum)
1 2 3 4 5 6 7 8 9 10
51 79 51 53 63 78 125 91 65 81
Select all rows where the first column is bigger than 27:
> credit.dat[credit.dat[,1] > 27,]
age married house income gender class
2 46 0 1 32 0 0
5 29 1 1 32 0 0
6 45 1 1 30 0 1
7 63 1 1 58 1 1
8 36 1 0 52 1 1
10 50 1 1 28 0 1
Construct a vector "x" with the numbers 2,5,10 in that order:
> x <- c(2,5,10)
> x
[1] 2 5 10
Construct a vector consisting of the numbers 1 through 10:
> c(1:10)
[1] 1 2 3 4 5 6 7 8 9 10
Select the *row numbers* of the rows where the first column of credit.dat is bigger than 27:
> c(1:10)[credit.dat[,1] > 27]
[1] 2 5 6 7 8 10
Draw a random sample of size 5 from the numbers 1 through 10 (without replacement):
> index <- sample(10,5)
> index
[1] 5 1 7 4 6
Select the corresponding rows:
> train <- credit.dat[index,]
> train
age married house income gender class
5 29 1 1 32 0 0
1 22 0 0 28 1 0
7 63 1 1 58 1 1
4 25 0 0 27 1 0
6 45 1 1 30 0 1
Select all rows with row number not in "index":
> test <- credit.dat[-index,]
> test
age married house income gender class
2 46 0 1 32 0 0
3 24 1 1 24 1 0
8 36 1 0 52 1 1
9 23 0 1 40 0 1
10 50 1 1 28 0 1
Consult the help page of the function "sample"
> help(sample)
At the end of a session (and also during a session), save your workspace to a file (choose
"Save Workspace" from the file menu). Otherwise all results (the functions you created, etc.)
will be lost after you quit R.
Practice exercise 1
Assume we have a classification problem with only 2 classes that are labeled 0 and 1 respectively.
Write a function that computes the impurity of a vector (of arbitrary length) of class labels.
Use the gini-index as impurity measure. Do not use a loop structure in your function,
this is not necessary.
Example:
> y <- c(1,0,1,1,1,0,0,1,1,0,1)
> y
[1] 1 0 1 1 1 0 0 1 1 0 1
> impurity(y)
[1] 0.2314050
If you are not working in Rstudio, to create the function, use:
> fix(impurity, editor="Notepad")
This will open a Notepad window. Type in the function definition, save the file and exit the editor.
Practice exercise 2
Write a function "bestsplit(x,y)" that computes the best split value on a numeric attribute x.
Here x is a vector of numeric values, and y is the vector of class labels (assume there
are only two classes, coded as 0 and 1). x and y must be of the same length: y[i] is the class
label of the i-th observation, and x[i] is the corresponding value of attribute x.
Only consider splits of type "x <= c" where "c" is the average of two consecutive values of x in the sorted order.
So one child contains all elements with "x <= c" and the other child contains all elements with "x > c".
The best split is the split that achieves the highest impurity reduction.
Example (best split on income):
> bestsplit(credit.dat[,4],credit.dat[,6])
[1] 36
Hint: Clever use of "subscripting" (selecting elements of vectors and matrices) is important in R. For example,
y[x > 29] produces a vector with all elements of y whose corresponding x-element (that is the element of x with
the same index) is bigger than 29. More formally: y[x > 29] = {y[i]: x[i] > 29}. The result is a vector, not a set, i.e.
duplicate values may occur. Just try it!
Hint: Example of how to determine candidate split points
> income.sorted <- sort(unique(credit.dat[,4]))
> income.sorted
[1] 24 27 28 30 32 40 52 58
> income.splitpoints <- (income.sorted[1:7]+income.sorted[2:8])/2
> income.splitpoints
[1] 25.5 27.5 29.0 31.0 36.0 46.0 55.0
Note: use the "brute force" approach, i.e. don't implement the "segment borders" algorithm.