```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(warn = -1)
```
## Step 1: Reading and partitioning of the data
We have read the dataset and divided it into training and test set according to the instructions.
```{r}
titanicData <- read.csv("titanic.csv")
titanicData <- titanicData[complete.cases(titanicData),]
titanicData$Survived <- as.factor(titanicData$Survived)
set.seed(1)
train <- sample(1:nrow(titanicData), size = nrow(titanicData)*0.8)
test <- dplyr::setdiff(1:nrow(titanicData), train)
titanicDataTrain <- titanicData[train, ]
titanicDataTest <- titanicData[test, ]
```
## Training and Testing the k-NN models
We have used train from caret to train a k\-NN model. We have also used tune grid to train the model for all k from 2 to 30. We shall choose the best k from the 10-fold cross validation.
```{r}
library(caret)
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 10)
knnFit <- train(Survived ~ Fare + Age, data = titanicDataTrain, method = "knn", trControl = ctrl, preProcess = c("center","scale"), tuneGrid = expand.grid(k = 2:30))
plot(knnFit)
knnFit$bestTune
```
The best fit across all k was for k \= 29 and 30. Since it is a binary classification, we prefer k \= 29 as it will not have ties.
Now, we shall use the 29-NN model to see the accuracy on test dataset.
## Accuracy on test dataset
```{r}
pred <- predict(knnFit, newdata = titanicDataTest)
confusionMatrix(pred, titanicDataTest$Survived )
```
The test accuracy is 0.6538. All the prediction was done for the class 0. As the data was biased towards the class 0, we see a bias in prediction. The accuracy may not be a good measure to train the dataset on as it favors class with more data points.