- 19th Oct 2021
- 06:03 am
Theoretical Understanding of Data Analytics Concepts
This assignment is meant to test your theoretical understanding of the concepts we have learned in class. This will also give you an understanding of the type of theoretical questions that you may come across in the Final Exam.
Please submit your answers in a word document. TurnItIn will be enabled.
Task: (10 Points Total) (0.909 Points per Sub-Task)
Instructions: Answer the following questions briefly. You can use the slides and external resources to look up the answers, however, please write the answers in your own words. If you are using an external resource to find the answer to the question, please cite the resource.
Sub-Task A) What are Ensemble methods?
Ensemble methods are methods of combining one or more classifiers or predictors to improve accuracy of the model. E.g. Random forest.
In ensemble methods, we combine different models by different methods to get one model which is better than every individual model. We may fuse outputs of all the models or may select one of them to make decision based on different parameters.
Sub-Task B) Are Decision Trees a type of Ensemble Method? Briefly justify your answer.
Decision tree is not an Ensemble method because we do not have any base classifier and any combiner for a decision tree. It has decision process based on single algorithm and it uses precisely the same. There are no lower level decisions made.
Sub-Task C) Are SVMs a type of Ensemble Method? Briefly justify your answer.
A single SVM is not a kind of Ensemble method. It does not combine the decision of multiple smaller level classifiers to make one final decision. There are no “weak learners” in the process.
Sub-Task D) Are Random Forests a type of Ensemble Method? Briefly justify your answer.
Yes. Random forests are Ensemble methods. The reason is that there are lower level classifiers (decision trees) and there’s a method of combination (mostly, majority vote) criteria and then final decision is made based on decisions of smaller entities.
Sub-Task E) Describe one difference between Bagging and Boosting Techniques.
In Bagging (Bootstrap Aggregating), we train multiple iterations of same model over different sets of data, generally bootstrapped, and make a final decision by combining all the models. The combinations are mostly done by majority vote or average.
In boosting, we develop a sequential decision process by training models and giving higher weights to misclassified data points. Here models are weighted by their accuracy or any other measure of performance.
Here the most important difference is that in Bagging, weak models/base models are not dependent on other weak/base models whereas in Boosting, the base model depends on performance of prior base models.
Sub-Task F) Describe two advantages and two limitations of Random Forests.
Advantages of random forest:
- It can create a powerful performing model on relatively simpler computational cost and simpler concept to understand.
- The model is trained on bootstrapped so there’s lesser need for very large dataset.
Disadvantages of random forest:
- Its statistical property is not very well researched and hence, not statistically sound. It does not have very reliable method to test significance of variables.
- It is very prone to overfitting.
Sub-Task G) Given that you are working on the "Attrition" data set and want to predict the probability of Attrition, which algorithms/models will be appropriate from the following:
- Random Forests
- Decision Trees
- Generalized Linear Models
- Support Vector Machines (SVMs)
- Linear Regression
Briefly justify your answer.
For predicting the probability of Attrition, it’s not a direct classification problem. Most appropriate method would be Generalized Linear Model (Like logistic regression).
Reason is that it has very sound theoretical result which models probability of the event and not the class. This underlying assumption into modelling makes it perfect to predict the probability of event. Generalized regression models are precisely used for these problems.
Sub-Task H) In the svm() function from the "e1071" package, name and briefly describe two hyper-parameters that we can optimize to fit the SVM model better to our data set.
Kernel: This parameter decides which kernel to fit like linear, polynomial, sigmoid or radial basis.
class.weights: This parameter can be used to give weights to each classes. It may be particularly helpful in unbalanced datasets.
Sub-Task I) Do we use the testing data set or the validation data set for hyper- parameter optimization?
We use Validation data set for hyper-parameter optimization.
Test data is used to compute the model performance. One should never use test data directly into model. Test data should be unseen from all computations of the model.
Sub-Task J) When using grid search, calculate the total number of models that we need to build for the following choice of hyper-parameters in an SVM model:
3 Kernels: Polynomial, Radial, Sigmoid
3 Cost Values: 1, 10, 100
5 Gamma Values: 0.1, 1, 10, 100, 1000
5 Degree Values: 0.1, 1, 10, 100, 1000
The total number of models we’ll build is: 2*3*5*1 + 1*3*5*5 = 30 + 75 = 125 models
The Degree values are used only for Polynomial
Sub-Task K) If we must validate 3 different models using a 20-fold Cross Validation, how many times in total will we train all 3 models?
Each one the three models would be trained exactly 20 times.
In 20-fold cross validation, we divide the dataset into 20 parts and sequentially train the model leaving out 1 part of the data each time and measure the performance on that 1- part of data which is out of sample. We repeat the process till all 20-part cycle is complete.
So clearly each model is trained 20 times. Total number of training would be 60.