```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(warn = -1)
```
## `Question 1`: Run a classification tree, using the default controls of rpart(). Looking at the validation set, what is the overall accuracy? What is the lift on the first decile?
```{r q1}
library(readr)
library(rpart.plot)
library(rpart)
library(caret)
library(lift)
eBayAuctions <- read.csv("eBayAuctions.csv")
eBayAuctions$Category <- as.factor(eBayAuctions$Category)
eBayAuctions$currency <- as.factor(eBayAuctions$currency)
eBayAuctions$endDay <- as.factor(eBayAuctions$endDay)
names1 = colnames(eBayAuctions)
names1[8] = "Competitive"
colnames(eBayAuctions)<- names1
eBayAuctions$Competitive = as.factor(eBayAuctions$Competitive)
n = nrow(eBayAuctions)
set.seed(12345)
train_samp = sample(1:n, floor(0.6*n), replace = F)
training = eBayAuctions[train_samp,]
validation = eBayAuctions[-train_samp,]
model.tree = rpart(Competitive ~., data = training, method = "class")
rpart.plot(model.tree)
pred = predict(model.tree, validation, type = "class")
confusionMatrix(pred, as.factor(validation$Competitive))
TopDecileLift(pred, validation$Competitive)
```
The default accuracy is 85\% for validation set. The top docile is 1.668
## `Question 2`: Run a boosted tree with the same predictors (use function boosting() in the adabag package). For the validation set, what is the overall accuracy? What is the lift on the first decile?
```{r q2}
library(adabag)
boost_model = boosting( Competitive~., data = training,mfinal = 10 , boos = T)
pred2 = predict(boost_model, newdata = validation, type = "class")
confusionMatrix(as.factor(pred2$class), validation$Competitive)
TopDecileLift(as.factor(pred2$class), validation$Competitive)
```
The accuracy of the boosting model 0.8796. The top docile is 1.761
## `Question 3`:Run a bagged tree with the same predictors (use function bagging() in the adabag package). For the validation set, what is the overall accuracy? What is the lift on the first decile?
```{r}
model_bag = bagging(Competitive~., data = training,mfinal=10 )
pred_bag = predict(model_bag, validation, type = "class")
confusionMatrix(as.factor(pred_bag$class), validation$Competitive)
TopDecileLift(as.factor(pred_bag$class), validation$Competitive)
```
The accuracy of the bagging model 0.8631. The top docile is 1.807
## `Question 4`: Run a random forest (use function randomForest() in package randomForest with argument mtry = 4). Compare the bagged tree to the random forest in terms of validation accuracy and lift on first decile. How are the two methods conceptually different?
```{r}
library(randomForest)
model_rf = randomForest(Competitive~., data = training, mtry = 4)
pred_rf = predict(model_rf, validation)
confusionMatrix(as.factor(pred_rf), validation$Competitive)
TopDecileLift(as.factor(pred_rf), validation$Competitive)
```
The accuracy of the random forest model 0.8847. The top docile is 1.738.
The fundamental difference is that in Random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node.