Predicting Titanic Survival with Kaggle Data
Posted on Fri 11 August 2017 in Data Science • Tagged with kaggle, random forests
A month ago, I finally joined Kaggle to get some practice applying machine learning algorithms. My first submission was to their Titanic competition. For the uninitiated, Kaggle is a website that hosts data science competitions, which are open to anyone, anywhere with an internet connection (with some exceptions).
Generally, competitors are trying to write algorithms that best predict some kind of outcome. In the Titanic competition, for example, Kaggle provides data on 891 actual passengers aboard the Titanic. This includes information like name, social class, other family on board, and most importantly, whether or not each passenger survived. The goal is to use this 'training data' to build some kind of model or algorithm that correctly predicts each passenger's survival outcome. But the real test is whether your algorithm accurately predicts the survival outcomes for a set of passengers for whom you do not have survival information. This is called the 'test data', and your final accuracy score is calculated based on the number of correct predictions you make for this unlabeled data.
Continue reading
Predicting Churn with Kaggle Data
Posted on Wed 26 July 2017 in Data Science • Tagged with kaggle, logistic regression
A few weeks ago I finally signed up for Kaggle and got my feet wet with a little machine learning. In this project, I analyzed (simulated) Human Resources data with respect to 14,999 employees to predict (and understand) which employees would give their two weeks notice. Employee retention (or conversely, 'churn') is a key problem faced by companies, as it is significantly more expensive to find, hire, and train new employees than it is to retain current ones. Thus many (all?) employers have a clear interest in understanding why people tend to leave and to identifying those who are currently at the highest risk of leaving.
Continue reading