Building the User Review Model with fastText (Text Classification)

My favorite tool for building text classification models is Facebook’s fastText. It’s open source and and you can run it as a command line tool or call it from Python. There are great alternatives like Vowpal Wabbit that also work well and are more flexible, but I find fastText easier to use.

You can install fastText by following these instructions.

Step 1: Download Training Data

To build a user review model, we need training data. Luckily, Yelp provides a research dataset of 4.7 million user reviews. You can download it here (but keep in mind that you can’t use this data to build commercial applications).

When you download the data, you’ll get a 4 gigabyte json file called reviews.json. Each line in the file is a json object with data like this:
{
  "review_id": "abc123",
  "user_id": "xyy123",
  "business_id": "1234",
  "stars": 5,
  "date":" 2015-01-01",
  "text": "This restaurant is great!",
  "useful":0,
  "funny":0,
  "cool":0
}

Step 2: Format and Pre-process Training Data

The first step is to convert this file into the format that fastText expects.

fastText requires a text file with each piece of text on a line by itself. The beginning of each line needs to have a special prefix of __label__YOURLABEL that assigns the label to that piece of text.

In other words, our restaurant review data needs to be reformatted like this:
__label__5 This restaurant is great!
__label__1 This restaurant is terrible :'(



# Note: This example code is written for Python 3.6+!
import json
from pathlib import Path
reviews_data = Path("dataset") / "review.json"
fasttext_data = Path("fasttext_dataset.txt")
with reviews_data.open() as input, fasttext_data.open("w") as output:
for line in input:
review_data = json.loads(line)
rating = review_data['stars']
text = review_data['text'].replace("\n", " ")
fasttext_line = "__label__{} {}".format(rating, text)
output.write(fasttext_line + "\n")



Running this creates a new file called fasttext_dataset.txt that we can feed into fastText for training. We aren’t done yet, though. We still need to do some additional pre-processing.

fastText is totally oblivious to any English language conventions (or the conventions of any other language). As far is it knows, the words Hello, hello and hello! are all totally different words because they aren’t exactly the same characters. To fix this, we want to do a quick pass through our text to convert everything to lowercase and to put spaces before punctuation marks. This is called text normalization and it makes it a lot easier for fastText to pick up on statistical patterns in the data.

This means that the text This restaurant is great! should become this restaurant is great !.

Here’s a simple Python function that we can add to our code to do that:

def strip_formatting(string):
string = string.lower()
string = re.sub(r"([.!?,'/()])", r" \1 ", string)
return string

Step 3: Split the data into a Training set and a Test set

To get an accurate measure of how well our model performs, we need to test it’s ability to classify text using text that it didn’t see during training. If we test it against the training data, it is like giving it an open book test where it can memorize the answers.

So we need to extract some of the strings from the training data set and keep them in separate test data file. Then we can test the trained model’s performance with that held-back data to get a real-world measure of how well the model performs.

Here’s a final version of our data parsing code that reads the Yelp dataset, removes any string formatting and writes out separate training and test files. It randomly splits out 90% of the data as test data and 10% as test data:


import json from pathlib import Path
import re
import random
reviews_data = Path("dataset") / "review.json"
training_data = Path("fasttext_dataset_training.txt")
test_data = Path("fasttext_dataset_test.txt")
# What percent of data to save separately as test data
percent_test_data = 0.10
def strip_formatting(string):
string = string.lower()
string = re.sub(r"([.!?,'/()])", r" \1 ", string)
return string
with reviews_data.open() as input, \
training_data.open("w") as train_output, \
test_data.open("w") as test_output:
for line in input:
review_data = json.loads(line)
rating = review_data['stars']
text = review_data['text'].replace("\n", " ")
text = strip_formatting(text)
fasttext_line = "__label__{} {}".format(rating, text)
if random.random() <= percent_test_data:
test_output.write(fasttext_line + "\n")
else:
train_output.write(fasttext_line + "\n")

Run that and you’ll have two files, fasttext_dataset_training.txt and fasttext_dataset_test.txt. Now we are ready to train!

Here’s one more tip though: To make your model robust, you will also want to randomize the order of lines in each data file so that the order of the training data doesn’t influence the training process. That’s not absolutely required in this case since the data from Yelp is already pretty random, but it’s definitely worth doing when using your own data.

Step 4: Train the Model

You can train a classifier using the fastText command line tool. You just call fasttext, pass in the supervised keyword to tell it train a supervised classification model, and then give it the training file and and an output name for the model:
fasttext supervised -input fasttext_dataset_training.txt -output reviews_model 

It only took 3 minutes to train this model with 580 million words on my laptop. Not bad!

Step 5: Test the Model

Let’s see how accurate the model is by checking it against our test data:
fasttext test reviews_model.bin fasttext_dataset_test.txt
N 474292
P@1 0.678
R@1 0.678


This means that across 474,292 examples, it guessed the user’s exact star rating 67.8% of the time. Not a bad start.

You can also ask fastText to check how often the correct star rating was in one of it’s Top 2 predictions (i.e. if the model’s top two most likely guesses were “5”, “4” and the real user said “4”):
fasttext test reviews_model.bin fasttext_dataset_test.txt 2
N 474292
P@2 0.456
R@2 0.912


That means that 91.2% of the time, it recalled the user’s star rating if we check its two best guesses. That’s a good indication that the model is not far off in most cases.

You can also try out the model interactively by running the fasttext predictcommand and then typing in your own reviews. When you hit enter, it will tell you its prediction for each one:
fasttext predict reviews_model.bin -
this is a terrible restaurant . i hate it so much .
__label__1this is a very good restaurant .
__label__4this is the best restaurant i have ever tried .
__label__5


Important: You have to type in your reviews in all lower case and with spaced our punctuation just like the training data! If you don’t format your examples the same way as the training data, the model will do very poorly.

Step 6: Iterate on the model to make it more accurate

With the default training settings, fastText tracks each word independently and doesn’t care at all about word order. But when you have a large training data set, you can ask it to take the order of words into consideration by using the wordNgrams parameter. That will make it track groups of words instead of just individual words.

For a data set of millions of words, tracking two word pairs (also called bigrams) instead of single words is a good starting point for improving the model.

Let’s train a new model with the -wordNgrams 2 parameter and see how it performs:
fasttext supervised -input fasttext_dataset_training.txt -output reviews_model_ngrams -wordNgrams 2


This will make training take a bit longer and it will make the model file much larger (since there is now an entry for every two-word pair in the data), but it can be worth it if it gives us higher accuracy.

Once the training completes, you can re-run the test command the same way as before:
fasttext test reviews_model_ngrams.bin fasttext_dataset_test.txt


For me, using -wordNgrams 2 got me to 71.2% accuracy on the test set, an improvement of nearly 4%. It also seems to reduce the number of obvious errors that the model makes because now it cares a little bit about the context of each word.

There are other ways to improve your model, too. One of the simplest but most effective ways is skim your training data file by hand and make sure that the preprocessing code is formatting your text in a sane way.

For example, my sample text pre-processing code will turn the common restaurant nameP.F. Chang into p . f . chang. That appears as five separate words to fastText.

If you have cases like that where important words that represent a single concept are getting split up, you can write custom code to fix it. In this case, you might add code to look for common restaurant names and replace them with placeholders like p_f_chang so that fastText sees each as a single word.

Step 7: Use your model in your program!

The best part about fastText is that it’s easy to call a trained model from any Python program.

There are a few different Python wrappers for fastText that you can use, but I like the official one created by Facebook. You can install it by following these directions.

With that installed, here’s the entire code to load the model and use it to automatically score user reviews:


import re
import redastText
def strip_formatting(string):
string = string.lower()
string = re.sub(r"([.!?,'/()])", r" \1 ", string)
return string
# Reviews to check
reviews = [
"This restaurant literally changed my life. This is the best food I've ever eaten!",
"I hate this place so much. They were mean to me.",
"I don't know. It was ok, I guess. Not really sure what to say."
]
# Pre-process the text of each review so it matches the training format
preprocessed_reviews = list(map(strip_formatting, reviews))
# Load the model
classifier = fastText.load_model('reviews_model_ngrams.bin')
# Get fastText to classify each review with the model
labels, probabilities = classifier.predict(preprocessed_reviews, 1)
# Print the results
for review, label, probability in zip(reviews, labels, probabilities):
stars = int(label[0][-1])
print("{} ({}% confidence)".format("" * stars, int(probability[0] * 100)))
print(review)
print()

And here’s what it looks like when it runs:
☆☆☆☆☆ (100% confidence)
This restaurant literally changed my life. This is the best food I've ever eaten!
☆ (88% confidence)
I hate this place so much. They were mean to me.
☆☆☆ (64% confidence)
I don't know. It was ok, I guess. Not really sure what to say.
Those are really good prediction results! And let’s see what prediction it would give my Yelp review:
☆☆☆☆☆ (58% confidence)
This used to be a giant parking lot where government employees that worked in the country building would park. They moved all the parking underground and built an awesome park here instead. It's literally the reverse of the Joni Mitchell song.


Perfect!

This is why machine learning is so cool. Once we figured out a good way to pose the problem, the algorithm did all the hard work of extracting meaning from the training data. You can then call that model from your code with just a couple of lines of code. And just like that, your program seemingly gains superpowers.
Now go out and build you own text classifier!

GitHub Repo for the Code


You can also find me on linkedin. I’d love to hear from you if I can help you or your team with machine learning.


Comments

Popular Posts