```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(warn = -1)
## `Question 1`: Run a classification tree, using the default controls of rpart(). Looking at the validation set, what is the overall accuracy? What is the lift on the first decile?
```{r q1}
eBayAuctions <- read.csv("eBayAuctions.csv")
eBayAuctions$Category <- as.factor(eBayAuctions$Category)
eBayAuctions$currency <- as.factor(eBayAuctions$currency)
eBayAuctions$endDay <- as.factor(eBayAuctions$endDay)
names1 = colnames(eBayAuctions)
names1[8] = "Competitive"
colnames(eBayAuctions)<- names1
eBayAuctions$Competitive = as.factor(eBayAuctions$Competitive)
n = nrow(eBayAuctions)
train_samp = sample(1:n, floor(0.6*n), replace = F)
training = eBayAuctions[train_samp,]
validation = eBayAuctions[-train_samp,]
model.tree = rpart(Competitive ~., data = training, method = "class")
pred = predict(model.tree, validation, type = "class")
confusionMatrix(pred, as.factor(validation$Competitive))
TopDecileLift(pred, validation$Competitive)
The default accuracy is 85\% for validation set. The top docile is 1.668
## `Question 2`: Run a boosted tree with the same predictors (use function boosting() in the adabag package). For the validation set, what is the overall accuracy? What is the lift on the first decile?
```{r q2}
boost_model = boosting( Competitive~., data = training,mfinal = 10 , boos = T)
pred2 = predict(boost_model, newdata = validation, type = "class")
confusionMatrix(as.factor(pred2$class), validation$Competitive)
TopDecileLift(as.factor(pred2$class), validation$Competitive)
The accuracy of the boosting model 0.8796. The top docile is 1.761
## `Question 3`:Run a bagged tree with the same predictors (use function bagging() in the adabag package). For the validation set, what is the overall accuracy? What is the lift on the first decile?
model_bag = bagging(Competitive~., data = training,mfinal=10 )
pred_bag = predict(model_bag, validation, type = "class")
confusionMatrix(as.factor(pred_bag$class), validation$Competitive)
TopDecileLift(as.factor(pred_bag$class), validation$Competitive)
The accuracy of the bagging model 0.8631. The top docile is 1.807
## `Question 4`: Run a random forest (use function randomForest() in package randomForest with argument mtry = 4). Compare the bagged tree to the random forest in terms of validation accuracy and lift on first decile. How are the two methods conceptually different?
model_rf = randomForest(Competitive~., data = training, mtry = 4)
pred_rf = predict(model_rf, validation)
confusionMatrix(as.factor(pred_rf), validation$Competitive)
TopDecileLift(as.factor(pred_rf), validation$Competitive)
The accuracy of the random forest model 0.8847. The top docile is 1.738.
The fundamental difference is that in Random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node.