---
title: "Linear Regression"
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Overview
The dataset seems to be from some hospitality chain, speaking fo hotels as how the respondents are rating their views of the stay regarding the hotel on the basis of different parameters such as cleaning, staffing, location, room service etc.
## Importing libraries
```{r}
library(readr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(lattice)
library(DataExplorer)
library(grDevices)
library(caret)
library(Metrics)
```
```{r}
df <- read.csv("study1data.csv")
dim(df) ## checking the dimensions
```
```{r}
any(is.na(df)) ## check for missing values
```
```{r}
str(df)
```
```{r}
summary(df)
```
## Exploratory data Analysis
```{r}
plot_intro(df)
```
## Histogram distribution
```{r}
plot_histogram(df)
```
## Dropping redundant variables
```{r}
df$Review_Date <- NULL
df$Day_of_visit_tentative <- NULL
```
```{r}
model <- lm(Review_Overall_Rating ~., data=df)
print(model)
```
One thing that is worth consideration is correlated predictors. Including multiple highly correlated features does not offer very much explanatory power beyond the power of including just one, but it negatively impacts variance of estimates so let’s find highly correlated predictors as our first pass:
```{r}
correlations <- cor(df[,1:7])
print(correlations)
```
```{r}
highCorrelations <- findCorrelation(correlations, cutoff = .6, verbose = TRUE)
```
```{r}
print(highCorrelations)
```
Carefully read the output from the findCorrelation() function call; it wants to eliminate columns 1, 4, 6, 5 and 2. But it wants to eliminate 3 because it’s correlated with 5 and 5 because it’s correlated with 8. But it we remove column 5, 3 won’t be strongly correlated with anything anymore! It is important to watch what your code is doing, in this case it doesn’t look like 4 is necessary to remove.
```{r}
highCorrelations <- highCorrelations[-1]
print(highCorrelations)
```
Anova test on categorical variables
```{r}
fit <- lm(formula = Review_Overall_Rating ~ VisitType, data = df)
b <- anova(fit)
b
```
# Step wise regression
```{r}
fitt <- step(lm(Review_Overall_Rating ~ Rating_Value
+ Rating_Location
+ Rating_Sleep_Quality
+ Rating_Rooms
+ Rating_Cleanliness
+ Rating_Service
+ VisitType
, data = df), direction = "both")
summary(fitt)
```
Splitting data into training, validation and testing dataset
```{r}
train_ind <- sample(1:nrow(df), size = floor(0.70 * nrow(df)))
```
```{r}
training<-df[train_ind,]
testing<-df[-train_ind,]
```
Building model for training dataset
```{r}
fit <- lm(Review_Overall_Rating ~ Rating_Value
+ Rating_Location
+ Rating_Sleep_Quality
+ Rating_Rooms
+ Rating_Cleanliness
+ Rating_Service, data = training)
summary(fit)
```
```{r}
require(MASS)
step3<- stepAIC(fit,direction="both")
```
```{r}
ls(step3)
step3$anova
```
```{r}
ggplot(df) +
aes(x = Review_Overall_Rating, fill = VisitType) +
geom_histogram(bins = 30L) +
scale_fill_hue() +
theme_minimal()
```
```{r}
library(ggplot2)
ggplot(df) +
aes(x = VisitType) +
geom_bar(fill = "#d8576b") +
theme_classic()
```
```{r}
ggplot(df) +
aes(x = Rating_Service, fill = VisitType) +
geom_histogram(bins = 30L) +
scale_fill_brewer(palette = "RdBu") +
theme_classic()
```