Knn R, K-Nearest Neighbor Implementation In R Using Caret Package

30th Jun 2022
06:03 am

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(warn = -1)
```

## Step 1: Reading and partitioning of the data

We have read the dataset and divided it into training and test set according to the instructions.

```{r}
titanicData <- read.csv("titanic.csv")
titanicData <- titanicData[complete.cases(titanicData),]
titanicData$Survived <- as.factor(titanicData$Survived)
set.seed(1)
train <- sample(1:nrow(titanicData), size = nrow(titanicData)*0.8)
test <- dplyr::setdiff(1:nrow(titanicData), train)
titanicDataTrain <- titanicData[train, ]
titanicDataTest <- titanicData[test, ]

```

## Training and Testing the k-NN models

We have used train from caret to train a k\-NN model. We have also used tune grid to train the model for all k from 2 to 30. We shall choose the best k from the 10-fold cross validation.

```{r}
library(caret)

set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 10) 
knnFit <- train(Survived ~ Fare + Age, data = titanicDataTrain, method = "knn", trControl = ctrl, preProcess = c("center","scale"), tuneGrid = expand.grid(k = 2:30))

plot(knnFit)
 knnFit$bestTune
```

The best fit across all k was for k \= 29 and 30. Since it is a binary classification, we prefer k \= 29 as it will not have ties.

Now, we shall use the 29-NN model to see the accuracy on test dataset.

## Accuracy on test dataset

```{r}
pred <- predict(knnFit, newdata = titanicDataTest)
confusionMatrix(pred, titanicDataTest$Survived )
```

The test accuracy is 0.6538. All the prediction was done for the class 0. As the data was biased towards the class 0, we see a bias in prediction. The accuracy may not be a good measure to train the dataset on as it favors class with more data points.