A (non-analytic) Introduction to Psychology Today Therapist Data

Posted on Wed 26 July 2017 in Data Science

In this inagural blog post, I want to introduce the motivation and data for a data science project I've been working on.

For the last 6 years, I have been working on a PhD in Harvard's Clinical Science program (part of the Psychology department). As part of my training, I have become familiar with the research base (or lack thereof) for various psychological therapies, as well as with the more general nuances and particulars in the field of mental health treatment.

Trying to put my expertise to good use, I have often provided guidance to friends and relatives seeking qualified therapists. And on many occasions I have done the leg work of searching for and evaluating potentional therapists.

Much of this time is spent on Psychology Today's (PT) Find a Therapist directory, the most extensive online public directory of mental-healthcare providers in the United States. If you were to search Google for a therapist in your city, PT's therapist directory would most likely be the first (non-Ad) search result. Unless you live in an very rural area, there are likely to be 100s of therapists within 20 miles of you that have a profile on PT. As such, it is probably the most widely used method of locating potential therapists (aside from asking friends and family).

For a relatively small monthly fee, mental health providers (typically in private practice) can take advantage of PT's online visibility and reach by creating a profile in their directory. Profiles are highly structured. Providers can include their title (e.g., psychologist, counselor), degree(s), years of experience, fee, issues treated, treatment orientations/modalities, and brief open-ended description of their approach, goals, and experience. When searching for a therapist, potential consumers (or 'patients', but I'll stick with 'consumers' because much of the therapist-finding process is conducted in the mindset of a consumer) can use search filters to display only those profiles that meet certain criteria.

In using PT's directory, a number of things hit me:

  1. Finding a therapist is HARD. Possibly for some of the following reasons.
  2. Within a particular category (e.g., providers who treat relationship difficulties), lots of providers sound the same. Just for kicks, here are a few quotes from different therapists on a single search results page:
    • I provide a safe, compassionate environment that is supportive and free from judgment.
    • I provide a safe, non-judgmental environment to help you gain a deeper understanding of yourself and your life experiences.
    • I provide a supportive, safe, nonjudgmental space to share your feelings and address aspects of yourself and your life that you'd like to change or enhance.
    • I believe that most of us can greatly benefit from a safe, trusting and collaborative therapeutic relationship.
  3. There are a lot of providers peddling pseudo-scientific, or downright non-scientific, treatments. (Snarky aside: 'Eclectic therapy' is just another way of saying 'I do whatever I want. I may or may not follow established research or best practices'). Can you imagine a physician saying this and getting away with it?
  4. The mental health field is highly unregulated. As long as you don't use one of a few protected titles (e.g., psychologist, licensed [anything], psychiatrist), you can call yourself pretty much anything else, and can provide almost any psycho-social service/treatment you want. And those consumers who are not trained in the mental health field are none the wiser.

On the heels of these insights, I thought that a more thorough investigation of PT profiles would be a fascinating way to learn about mental health treatment providers in the wild (kind of). The data may be useful in addressing questions about provider qualifications, the connections between treatment orientations and different psychological issues, and the best way to use/evalate providers' profiles. And because it is a national database, profiles can be examined for variations across regions, levels of urbanicity, etc.

For the remainder of this post, I will describe my data collection and cleaning process, and give an overview of the different dataframes I created to simplify different kinds of analyses. In future posts I will use the data to address some selected questions.


Data

All the code for data scraping and cleaning can be found in my github repository. The scripts are numbered in the order they were executed. All code is written in Python 2.7.

Sampling

From each of the 50 US states, 200 profiles were semi-randomly selected. To do this, the following process was followed for each provider: First, a random zip code was selected from a specific state (e.g., Kansas). PT was then queried using this zip code. Because PT appears to randomly shuffle its search results across identical queries (i.e., if you search 02139 twice, the order of therapists will be different), the first profile in the results was selected. This process was repeated >=200 times for each state (enough times that 200 different profiles were selected).

therapist_profiles.csv

From each selected profile, I scraped the following information (if provided):

  • provider name
  • title (e.g., Licensed Clinical Social Worker)
  • degree(s)
  • years of experience (if a range provided, I rounded down)
  • school
  • year graduated
  • licence (# and issuing state)
  • fee (if range provided, average taken)
  • accepts insurance (yes/no only)
  • city, state, zip
  • list of specialities
  • list of issues treated
  • list of "mental health" problems treated (this turns out to be useless)
  • treatment orientation
  • treatment modality

For convenience, I also included variables that act as counts for the number of [degrees/specialties/issues/treatment orientations] that each provider listed. These have a suffix of 'num'

In [2]:
import pandas as pd
data = pd.read_csv('../data/therapist_profiles.csv', index_col='id_num')
data.head()
Out[2]:
name title degrees city state ZIP profile years school statelicense ... issuesnum mentalhealth mentalhealthnum sexuality treatmentapproach treatmentapproachnum treatmentorientation treatmentorientationnum url region
id_num
108128 David Joseph Alpert counselor lmhc, ladc, cadc, ncc, pgs West Newton Massachusetts 02465 My ideal client values authenticity. By authen... 20.0 Rhode Island College 896 Massachusetts ... 43.0 Dissociative Disorders, Impulse Control Disord... 5.0 Bisexual, Gay, Lesbian NaN 0.0 Art Therapy, Attachment-based, Client-Centered... 25.0 https://therapists.psychologytoday.com/rms/pro... northeast
273873 Tim Turco psychologist phd Columbus Georgia 31901 A struggling marriage is not failed marriage. ... NaN NaN PSY002346 Georgia ... NaN NaN NaN NaN NaN 0.0 Family / Marital 1.0 https://therapists.psychologytoday.com/rms/pro... south
262595 Tanya DiGiovanni-Goldbach licensed professional counselor ma, lpc Atlanta Georgia 30309 If you are reading this profile you are taking... NaN NaN LPC005739 Georgia ... 5.0 NaN NaN Bisexual, Gay, Lesbian NaN 0.0 Client focused 1.0 https://therapists.psychologytoday.com/rms/pro... south
245763 Kristen Stitt marriage & family therapist lmft Hutchinson Kansas 67501 I am a Licensed Marriage and Family Therapist ... 2.0 Friends University 2628 Kansas ... 23.0 Dissociative Disorders, Impulse Control Disord... 3.0 NaN NaN 0.0 Attachment-based, Cognitive Behavioral (CBT), ... 11.0 https://therapists.psychologytoday.com/rms/pro... midwest
328035 Angela Konitzer licensed professional counselor ma, lpc New London Wisconsin 54961 Angie is a Licensed Professional counselor and... 6.0 NaN 5137 ... 22.0 NaN NaN NaN NaN 0.0 Attachment-based, Christian Counseling, Cognit... 12.0 https://therapists.psychologytoday.com/rms/pro... midwest

5 rows × 25 columns

Convenience data structures

A few of variables consist of lists: For instance, 'degrees' consists of lists of providers' degrees, and 'specialties' consists of lists of providers' specialties. ('treatmentorientation' and 'issues' are also lists).

To make certain analyses of these list-data more convenient I constructed a number of other data structures.

profiledict.json

'profiledict.json' is a dictionary that provides each of the 4 list-variables in list form (as opposed to string form, as they are in the dataframe above). It also provides counts for each unique response (e.g., the number of providers who list 'PhD' in their degrees).

In [45]:
import json
with open("../data/profiledict.json", "r") as fd:
    profile_dict = json.load(fd)

print '\nFrom profiledict.json\n'

print '"provider_lists" for 5 providers:'
for k in profile_dict['degrees']['provider_lists'].keys()[0:5]:
    print '{}: {}'.format(k, profile_dict['degrees']['provider_lists'][k])

print '\n"counts" of 5 degrees:'
for k in profile_dict['degrees']['counts'].keys()[0:5]:
    print '{}: {}'.format(k, profile_dict['degrees']['counts'][k])
From profiledict.json

"provider_lists" for 5 providers:
110557: [u'lpc']
172798: [u'psyd', u'med', u'cams-ii', u'csac']
35548: [u'msw', u'licsw']
130468: [u'lpc']
185100: [u'ms', u'jd', u'lmft', u'cadc-i']

"counts" of 5 degrees:
scl: 1
messc: 1
coach: 6
lacd: 1
edd: 67

profilefeatures_bool_dict.json

This is a dictionary containing 4 sub-dictionaries (one for each list variable) that should be immediately converted to DataFrames. Each dataframe consists of dummy-codes for the endorsed degree/specialty/issue/treatment orientation. Note: only response options that were included for >20 providers were included as possible dummy codes. So, for instance, the degree 'lacd' (from above), which was endorsed by only 1 provider, is not a dummy coded variable.

In [52]:
with open("../data/profilefeatures_bool_dict.json", "r") as fd:
    features_dict = json.load(fd)
print 'Issues:'
pd.DataFrame(features_dict['issues']).head()
Issues:
Out[52]:
addiction adhd adoption alcohol abuse alzheimer anger management antisocial personality anxiety asperger attachment ... substance abuse suicidal ideation teen violence testing and evaluation transgender trauma and ptsd traumatic brain injury video game addiction weight loss women
100028 1 1 1 1 0 1 0 1 0 0 ... 1 1 0 0 0 0 0 0 1 0
100057 1 1 0 0 0 1 0 1 0 0 ... 0 0 0 1 1 1 0 0 0 0
100063 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 1 0 0 0 0
100077 0 0 0 0 0 1 0 1 0 0 ... 1 0 0 0 0 0 0 0 0 0
100093 1 0 0 0 0 1 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 80 columns

In [53]:
print 'Degrees:'
pd.DataFrame(features_dict['degrees']).head()
Degrees:
Out[53]:
abpp acs acsw alc atr atr-bc ba bc bcba bcc ... plmhp plpc pmhnp psyd rn rpt rpt-s ryt sap sep
100028 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
100057 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
100063 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
100077 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
100093 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 118 columns


Conclusion

Stay tuned for analytic posts.