Using Reddit and NLP to Diagnose Different Types of Depression

No seriously. Machine learning can better predict mental health than the current industry standard.

Photo by Anthony Tran on Unsplash

The purpose of this project will be to differentiate between the two, so it’s important to know the general difference. The key difference between the two is that unipolar depression results in a consistent depressed state while bipolar depression fluctuates, oftentimes in cycles of extreme low and high moods.

Misdiagnosis of Bipolar disorder is widespread. In a study from 1999, it was discovered that 40% of bipolar patients had been misdiagnosed in the past, most with major depression (source). Misdiagnoses are also not caught for long periods of time. A survey done in Europe on 1000 people with bipolar disorder found that on average it took 5.7 years to correct a misdiagnosis (source). The consequences can be severe as Tanvir Singh (MD) and Muhammad Rajput (MD) write:

An incorrect diagnosis of unipolar depression carries the risk of inappropriate treatment with antidepressants, which can result in manic episodes and trigger rapid cycling. Delay in start of mood stabilizers in bipolar disorder patients has been associated with increased healthcare costs, which include increased suicide attempts and higher rates of hospital use. — Psychiatry MMC

The current non-human diagnosis standard is the Mood Disorder Questionnaire, which when evaluated in the American Journal of Psychology in 2000, produced a sensitivity (recall) score of 73%, which means there are still 27% of people who are bipolar and not properly diagnosed by this questionnaire.

Photo by Paola Chaaya on Unsplash

The Project

This project intends to create a model that will correctly classify depression and bipolar posts on Reddit and provide insight for real world applications of mental health diagnostics. We will be looking at overall accuracy of predictions, but primarily sensitivity (recall), which reflects the percentage of bipolar posts that were correctly predicted as bipolar. The baseline as mentioned above will be 73%.

Feel free to take a look through the methodology below, but you can also skip to the analysis part here.

Methodology:
Data Collection
Text Preprocessing
Sentiment Analysis
Modeling
Analysis:
Results

The data used in this project is from the pushshift API. In my case, I pulled unique posts with over 100 characters because I wanted to use data that is text focused, not memes or links. It requests the data from r/depression & r/bipolar in json format which I found easier to convert to a pandas dataframe for further preprocessing.

Feel free to use the below function, but of course try and limit the amount of pull requests:

Demonstration by Author of Function

First we have to prepare data for the model. This means lemmatizing and tokenizing, basically preparing the data for the model. Here are some simple, but effective preprocessing steps:

You can also take a look at this in depth tutorial here.

We can then look at our top occurring words. You’ll notice that “depression” occurs frequently, but that might not be the best word to include in our model because it would create too much bias. We’ll filter words out like this later in the modeling process.

Image by Author

We can also do a sentiment analysis that might help us understand our data better and model more efficiently. This visualization of the sentiment analysis shows a lot of posts that were determined to be very negative (closest to -1), which makes sense, but it ALSO shows a lot of positive posts (closest to 1), which is more surprising.

Image by Author

Going a little deeper we can start to get some valuable info. Comparing the amount of very negative and very positive posts could help us understand those extremes better.

Image by Author

Interesting. There’s more positive posts in the bipolar subreddit. Which if we think about it, might make more sense. Bipolar disorder results in manic mood swings including very high highs. Somebody writing about their bipolar experiences might be more likely to use positive language! We definitely don’t want to drop this data.

Creating an NLP Model

Lastly, let’s get to the modeling. I’ll use a pipeline with TF-IDF and a Random Forest Classifier for this.

The best thing to do is to take a look at all the TF-IDF and Random Forest parameters and perform an iterative GridSearch to find the best ones:

These were the parameters I ended up using for the final model but it all depends on what you’re modeling. Notice the max_df and min_df parameters. These tell the model to only use words that appear a certain minimum or maximum amount of times. This helps to exclude words that might make our model overly biased like “depression”.

tvec = TfidfVectorizer(max_df=700, 
max_features=8000,
min_df=20,
ngram_range=(1, 2),
stop_words='english')
rf = RandomForestClassifier(max_depth=90,
max_features=90,
min_samples_leaf=1,
min_samples_split=25)
Photo by Paul Gilmore on Unsplash

The Future of Mental Health?

NLP has a huge future in the field of mental health diagnosis. If a new test was written using natural language responses, a model like this could possibly detect things even a trained psychiatrist couldn’t. Going further, it could be used similarly to the way a blood glucose meter is with diabetes patients. A bipolar patient could write in a journal that tracks manic cycles using machine learning algorithms. This could possibly save the lives of thousands of people.

It could also lead to a low-cost alternatives to expensive psychiatric care, especially in developing countries where mental health resources are slim to none. While seeing a licensed mental health care provider is the best option, something like this is still better than self-diagnosis and could lead people to getting the proper care that they need to live a full life.

Most importantly, it could help lower the misdiagnosis of unipolar and bipolar depression. These misdiagnoses result in serious consequences and NLP could be the key to preventing that. Currently the industry standard questionnaire has a recall score of 73% percent, while by just using a couple months of reddit posts, this project was able to achieve a recall score of over 80%. While they are not direct parallels, it goes to show that the industry can improve by integrating NLP based machine learning algorithms.

Feel free to use my code and contact me through LinkedIn if you have any questions about the project!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store