Last meetup, the Sourcelabs crew decided to pick up a technical topic that was quite far out of its comfort zone: the wonderful world of Machine Learning. A topic that has its challenges both conceptually as well as technically. Most of us had little experience with Machine Learning before, but we had to start somewhere. First, we got warmed up with 2 videos in which the concepts were briefly explained.
Then we wanted to get our hands dirty. A good place to start turned out to be Kaggle.com. This is a platform on which contestants can score points by solving Data Science challenges, from which most involve a Machine Learning aspect. We picked up a popular beginner’s challenge: Titanic – Machine Learning from Disaster. In this challenge, the contestant is given a dataset of the passengers of the Titanic, along with some personal details, including age, ticket class, name and title, whether this person had family on board, and most importantly: whether this person survived the tragedy. A second dataset contained the same data (for different passengers), except whether this person survived. The goal was to predict for each passenger in the second set, based on its personal data, whether this person would survive. What’s interesting about this challenge, is that there are lots of different ways to come up with a solution. We divided in groups of two, to work on different solutions. Kaggle provides an interactive (Jupyter) Notebook that allows the programmer to write Python code in different cells, and run them separately, instead of having to run the script in its entirety every time. Also, it makes it easy to add markdown to a script, and plot graphs that you can use to analyze the data. Before we could start training a Machine Learning a model, we would have to analyze the data, do some data cleaning, and potentially enrich the data.
Certain data columns potentially contained valuable information that had to be extracted first (called feature engineering): The name field would contain something like: ‘Palsson, Master. Gosta Leonard’. Consistently, each passenger’s name included a title like Master, Mrs, Captain, etc. It is practically infeasible for a machine learning model to discover by itself that this title, hidden in the name string, correlates with the chances of survival. Extracting the title with a simple regex and moving it to a separate column that was fed into the model, turned out to significantly increase the prediction’s accuracy.
This wouldn’t be a proper Data Science assignment if there wouldn’t be some incomplete data. For a portion of the passengers, the age was unknown. One could choose to fully ignore this column, but a smarter approach would be to try to estimate the age of passengers from which the age is unknown. A separate machine learning model could be trained for this specific task. A simpler approach would be to group the passengers with known ages by certain characteristics, and calculate the average age per group. One group could for example be: ‘Women with the title Mrs owning a 3rd class ticket’. Passengers with an unknown age in this same class would then be given the average age of the group.
When the dataset was ready for training, a model had to be chosen. The scikit-learn Python library offers a really easy way to use a model in a plug-and-play way, and train it with the given data. This also make it easy to swap models, and compare their performances. We compared some different models, including Logistic Regression, and a Random Forest. Sometimes a model would give a high accuracy during training, but a low accuracy when performed on the actual test set, indicating overfitting. In the end, most of us managed to train a model that would give a prediction accuracy of 75% or higher, proving that Machine Learning can really learn from data, without having to program predefined rules.
There is still so much to learn about Machine Learning, both in terms of state-of-the-art tools, as well as the concepts behind them, but this Titanic challenge turned out to be a really fun and educational way to get warmed up with the topic.
In the evening, we had a great dinner in restaurant De Saffraan in Amersfoort, where we enjoyed a 7-dish dinner with all kinds of kitchen miracles. Coincidentally this restaurant was located on a boat, but lucky for us this one didn’t sink đ