Predicting Titanic Survival with Kaggle Data

Posted on Fri 11 August 2017 in Data Science • Tagged with kaggle, random forests

A month ago, I finally joined Kaggle to get some practice applying machine learning algorithms. My first submission was to their Titanic competition. For the uninitiated, Kaggle is a website that hosts data science competitions, which are open to anyone, anywhere with an internet connection (with some exceptions).

Generally, competitors are trying to write algorithms that best predict some kind of outcome. In the Titanic competition, for example, Kaggle provides data on 891 actual passengers aboard the Titanic. This includes information like name, social class, other family on board, and most importantly, whether or not each passenger survived. The goal is to use this 'training data' to build some kind of model or algorithm that correctly predicts each passenger's survival outcome. But the real test is whether your algorithm accurately predicts the survival outcomes for a set of passengers for whom you do not have survival information. This is called the 'test data', and your final accuracy score is calculated based on the number of correct predictions you make for this unlabeled data.


Continue reading

Who are today's mental health practitioners in private practice?

Posted on Thu 10 August 2017 in Data Science • Tagged with psychology, mental health, psychology today


In a previous post, I introduced some data I scraped from Psychology Today's Find A Therapist directory. The data (and code) can be found in my github repository.

The first thing I'd like to do with this data is learn a bit more about the providers of therapy in private practice today. Who are they? How were they trained? What degrees do they hold? What issues do they treat? And what methods do they use?


Continue reading

Predicting Churn with Kaggle Data

Posted on Wed 26 July 2017 in Data Science • Tagged with kaggle, logistic regression

A few weeks ago I finally signed up for Kaggle and got my feet wet with a little machine learning. In this project, I analyzed (simulated) Human Resources data with respect to 14,999 employees to predict (and understand) which employees would give their two weeks notice. Employee retention (or conversely, 'churn') is a key problem faced by companies, as it is significantly more expensive to find, hire, and train new employees than it is to retain current ones. Thus many (all?) employers have a clear interest in understanding why people tend to leave and to identifying those who are currently at the highest risk of leaving.


Continue reading

A (non-analytic) Introduction to Psychology Today Therapist Data

Posted on Wed 26 July 2017 in Data Science • Tagged with psychology, mental health, psychology today

In this inagural blog post, I want to introduce the motivation and data for a data science project I've been working on.

For the last 6 years, I have been working on a PhD in Harvard's Clinical Science program (part of the Psychology department). As part of my training, I have become familiar with the research base (or lack thereof) for various psychological therapies, as well as with the more general nuances and particulars in the field of mental health treatment.


Continue reading