- 20th Oct 2022
- 06:03 am
R Programing Assignment Solution - R Program on Respiratory Infections
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
$\underline{\textbf{Description of Dataset:}}$
In our project, we are using the "Exasens" dataset from the UCI Machine Learning Repository. The Exasens dataset includes demographic information on $4$ groups of saliva samples (COPD-Asthma-Infected-HC) collected in the frame of a joint research project, Exasens, at the Research Center Borstel, BioMaterialBank Nord (Borstel, Germany). The $4$ sample groups included within the Exasens dataset are defined as:
(I) Patients with COPD who are hospitalised and outpatients but do not have an acute respiratory infection (COPD).
(II) Patients with asthma who are hospitalised and outpatients but do not have severe respiratory illnesses (Asthma).
(III) Patients without COPD or asthma who have respiratory illnesses (Infected).
(IV) Controls in good health without COPD, asthma, or any respiratory illnesses (HC).
The dataset contains $6$ attributes, namely,
Diagnosis (COPD-HC-Asthma-Infected) - tells us what the patient is afflicted by
ID - gives a unique identity index for each patient
Age - provides the patient's age
Gender - denoted by $1$ for male and $0$ for female
Smoking Status - denoted by $1$ for non-smoker, $2$ for ex-smoker and $3$ for active-smoker
Saliva Permittivity - where the numbers have an imaginary and a real part, and the minimum and average of observations for the imaginary and real part are given as Min. and Avg. respectively.
$\underline{\textbf{Summarizing the Dataset:}}$
Next, we are going to view the top of the data matrix and see what it looks like.
```{r}
data = read.csv("Exasens.csv")
head(data)
```
We note that the columns $X.2,X.3,X.4,X.5$ contain filler space and notation that we have already specified, so we remove them from our data-frame and check the data types of the remaining variables using the str() command.
```{r}
data = subset(data, select = -c(X.2,X.3,X.4,X.5))
str(data)
```
which reveals that the values for salivary permittivity are factor variables, while the $\texttt{Gender, Age, Smoking}$ variables are integers. We can view an overall summary of our data as:
```{r}
summary(data)
```
which shows us the frequency of the "Diagnosis" variable and shows $2$ NA values in each of $\texttt{Gender, Age, Smoking}$ which are due to the top two redundant rows. Since these three variables are categorical variables, it does not make sense to look at their quartiles, but the results are shown as such because they are interpreted as integer values.
$\underline{\textbf{Cleaning the Dataset:}}$
Now, we take the following steps in cleaning of the data:
(1) We remove the first two rows.
(2) Looking at the whole data now, we see that there are lots of rows whose $\texttt{Imaginary.Part}, X, \texttt{Real.Part}, X.1$ data are missing. There are $399$ rows in the data, and $299$ of them have $\texttt{Imaginary.Part}, X, \texttt{Real.Part}, X.1$ missing. So, we decide to delete the columns $\texttt{Imaginary.Part}, X, \texttt{Real.Part}, X.1$.
We do not lose much this way, in a statistical sense, because our variable of interest being $\texttt{Diagnosis}$, the diagnosis is obviously derived from measurements of medical variables which were encoded in the variables $\texttt{Imaginary.Part}, X, \texttt{Real.Part}, X.1$, besides having huge amount of missing data for these variables.
(3) We transform the data types of the variables $\texttt{Gender}$ and $\texttt{Smoking}$ to type factor, since they are categorical variables.
```{r}
data = data[-c(1,2),]
data = subset(data, select = -c(Imaginary.Part, X, Real.Part, X.1))
data$Gender = as.factor(data$Gender)
data$Smoking = as.factor(data$Smoking)
head(data)
```
$\underline{\textbf{Choosing Variables to Analyze:}}$
Now, we have $\texttt{Diagnosis, ID, Gender, Age, Smoking}$ from which we are going to analyze the association of $\texttt{Gender, Age, Smoking}$ with the $\texttt{Diagnosis}$.
The $\texttt{ID}$ is only as good as the serial number of the rows since it is used as a unique identifier for the patients, so we are not going to use that variable.
$\underline{\textbf{Histograms and Barplots:}}$
For the barplots, we will need the counts of COPD, Asthma, HC, Infected classified according to Gender ($1$ or $0$) and according to the nature of Smoking history ($1$, $2$ or $3$). For that, we construct the following contingency tables.
```{r}
dfgender = data.frame(Diagnosis = data$Diagnosis, Gender = data$Gender)
ctgender = table(dfgender)
ctgender
dfsmoking = data.frame(Diagnosis = data$Diagnosis, SmokingHistory = data$Smoking)
ctsmoking = table(dfsmoking)
ctsmoking
```
Now we use the counts from these tables to make our barplot for diagnosis coupled with gender ...
```{r}
diagngender = c(55,23,104,58,25,56,56,22)
diagn <- c("Asthma","COPD","HC","Infected")
barplot(matrix(diagngender, ncol=4, byrow=TRUE), names.arg = diagn)
```
... and for diagnosis coupled with smoking habit
```{r}
diagnsmoking = c(36,6,93,44,35,63,35,17,9,10,32,19)
barplot(matrix(diagnsmoking, ncol=4, byrow=TRUE), names.arg = diagn)
```
We can visualize the distribution of the variable $\texttt{Age}$ in the form of a histogram.
```{r}
hist(data$Age)
```
$\underline{\textbf{Making the above plots more readable:}}$
In this part, we add graph title, legends, axis labels, curves to fit the histograms and colours to help distinguish the same graphs above more easily.
```{r}
barplot(matrix(diagngender, ncol=4, byrow=TRUE), names.arg = diagn, xlab = "Diagnosis", ylab = "Counts", main = "Diagnosis vs gender", col = c("red", "yellow"))
legend("topleft", legend = c("Female", "Male"), cex=1.3, fill=c("red", "yellow"))
barplot(matrix(diagnsmoking, ncol=4, byrow=TRUE), names.arg = diagn, xlab = "Diagnosis", ylab = "Counts", main = "Diagnosis vs smoking habit", col = c("green", "blue", "brown"))
legend("topleft", legend = c("Non-Smoker", "Ex-Smoker", "Active Smoker"), cex=1.3, fill=c("green", "blue", "brown"))
hist(data$Age, main = "Distribution of Age among the patients", xlab = "Age", ylab = "Frequency of being admitted to the hospital", border = "red", col = c("violet", "yellow"))
par(new=T)
plot(density(data$Age), main = "Distribution of Age among the patients", xlab = "Age", ylab = "Frequency of being admitted to the hospital", axes=F)
```
$\underline{\textbf{Graphs to Explore Relations Among Variables:}}$
We partition the data according to the diagnoses to look at how the variables $\texttt{Age, Gender, Smoking}$ influence the individual medical conditions.
```{r}
dfasthma = data[data$Diagnosis == "Asthma",]
dfcopd = data[data$Diagnosis == "COPD",]
dfhc = data[data$Diagnosis == "HC",]
dfinfected = data[data$Diagnosis == "Infected",]
```
First, we look at $\texttt{Age}$.
```{r}
hist(dfasthma$Age, main = "Distribution of Asthma affected people according to age", xlab = "Age", ylab = "Frequency of having asthma", border = "red", col = c("violet", "yellow"))
par(new=T)
plot(density(dfasthma$Age), main = "Distribution of Asthma affected people according to age", xlab = "Age", ylab = "Frequency of having asthma", axes=F)
hist(dfcopd$Age, main = "Distribution of COPD affected people according to age", xlab = "Age", ylab = "Frequency of having COPD", border = "black", col = c("violet", "yellow"))
par(new=T)
plot(density(dfcopd$Age), main = "Distribution of COPD affected people according to age", xlab = "Age", ylab = "Frequency of having COPD", axes=F)
hist(dfhc$Age, main = "Distribution of healthy people according to age", xlab = "Age", ylab = "Frequency of being healthy", border = "green", col = c("violet", "yellow"))
par(new=T)
plot(density(dfhc$Age), main = "Distribution of healthy people according to age", xlab = "Age", ylab = "Frequency of being healthy", axes=F)
hist(dfinfected$Age, main = "Distribution of Infected people according to age", xlab = "Age", ylab = "Frequency of being Infected", border = "Chocolate2", col = c("violet", "yellow"))
par(new=T)
plot(density(dfinfected$Age), main = "Distribution of Infected people according to age", xlab = "Age", ylab = "Frequency of being infected", axes=F)
```
If we want to look at gender-specific pattern or smoking-behaviour-specific pattern among the individual diagnoses, that is given precisely by the bar charts done in the previous section.
$\underline{\textbf{Conclusion:}}$
From the histograms for diagnosis-specific-age-frequency, we can see that the majority of the people affected by asthma and COPD are in the $60-80$ years of age bracket, while patients marked HC, i.e. the patients that are in healthy condition are the younger patients that are in the $20-30$ years of age bracket.
From the bar chart on gender done earlier we see that more females are affected by asthma, more female patients are healthy, more female patients are infected by diseases other than asthma and copd but more males are affected by COPD.
From the bar chart on smoking habits done earlier we see that most of the healthy patients or those infected by something other than asthma and COPD are non-smokers, while almost $\textbf{all}$ of the patients with COPD are ex-smokers. Since female patients are higher on all other three diagnoses besides COPD, we can get a sense that most of the COPD patients are male ex-smokers.