Machine Learning with Python (2021 Fall semester)
Programming Assignment: Classification of Titanic Data Set
- Benchmark Dataset: This is the problem of predicting survivals based on the information of the
people on board the Titanic. You should evaluate the performance of each model using the machine
learning models presented in the assignment. You can download the dataset from the following
website: https://www.kaggle.com/c/titanic
In this assignment, both model training and testing use the train.csv file.
When performing the task, be careful NOT to use the following features for model training:
PassengerId, Name, Ticket, Cabin - Preprocessing
- There are data with missing values in the train data. Remove these data.
- Use the sample of train.csv 7 to 3 as training data and test data.
- Machine Learning Models: Use scikit-learn to implement the following three machine learning
models and evaluate their performance.
3-1 K-Nearest Neighbors(KNN) (sklearn.neighbors.KNeighborsClassifier): Analyze how the results
change in the test data while changing the number of K to [3-5].
3-2 Logistic Regression (sklearn.linear_model.LogisticRegression): Analyze how the results change in
the test data while changing the number of iterations (max_iter) by 20 in the range of [0-100]. After
fixing the number of iterations to 100, change the regularization term (C in scikit-learn) by 1 in the
range of [1 to 5] and analyze how the results change in the test data.
3-3 Decision Tree (sklearn.tree.DecisionTreeClassifier): Analyze the separation criteria of the first and
second depths in the decision tree with information gain. Also, when max_depth=None, use an
appropriate tool to visualize the tree to know the condition and gain values at each depth. Analyze
how the results change in the test data when max_depth is changed to [1~3, None]. - Evaluation Methods: Show the performance according to each model through Accuracy and F1-
Score. - Submission Form: There are 3 files to be submitted. You can submit the csv file, report, and python
file in a zip file. The file name must follow the student number_name.zip format (eg,
2020714950_Hong_Gil-dong.zip). When the python file is executed while the csv file and the python
file are in the same directory, it should be clearly expressed how the results are from each machine
learning model. This is to check whether the performance in the report is similar to the performance
in actual execution. If you wrote it as an ipynb file, you can submit it instead of a python file.