Text Classification

November 17, 2018

Text Classification

Here, we are going to learn about text classification — the secret weapon that NLP developers use to build cutting edge systems with relatively dumb models.

To learn what is NLP and get started first you can go through this article.

The kind of results you can get with text classification compared to the development effort is off the charts.

Machine Learning Pipeline

The NLP pipeline that we will be using:

First we split text into sentences, then we break sentences down into nouns and verbs, then we figure out the relationships between those words, and so on. It’s a very logical approach and logic just feels right, but logic isn’t necessarily the best way to go about extracting data from text.

A lot of user-created content is messy, unstructured and, some might even say, nonsensical.

Extracting data from messy text by analyzing it’s grammatical structure is very challenging because the text doesn’t follow normal grammatical rules. We can get often get better results using dumber models that work from the bottom up. Instead of analyzing sentence structure and grammar, we’ll just look for statistical patterns in word use.

Using Classification Models to Extract Meaning

Let’s look at user reviews, one of the most common types of online data that you might want to parse with a computer. Here is one of Yelp reviews for a public park:

From the screenshot, you can see that user gave the park a 5-star review. But if he had posted this review without a star rating, you would still automatically understand that he liked the park from how he described it.

How can we write a program that can read this text and understand that user liked the park even though he never directly said “I like this park” in the text? The trick is to reframe this complex language understanding task as a simple classification problem.

Let’s set up a simple linear classifier that takes in words. The input to the classifier is the text of the review. The output is one of 5 fixed labels — “1 star”, “2 stars”, “3 stars”, “4 stars”, or “5 stars”.

If the classifier was able to take in the text and reliably predict the correct label, that means it must somehow understand the text enough to extract the overall meaning of whether or not user liked the the park. Of course the model’s level of “understanding” is just that it churns some data through a statistical model and gets a most likely answer. It’s not similar to human intelligence. But if the end result is the same most of the time, then it doesn’t really matter.

To train our text classification model, we’ll collect a lot of user reviews of similar places (parks, businesses, landmarks, hotels, whatever we can find…) where the user wrote a text review and assigned a similar star rating. And by lots, I mean millions of reviews! Then we’ll train the model to predict a star rating based on the corresponding text.

Once the model is trained, we can use it to make predictions for new text. Just pass in a new piece of text and get back a score:

With this simplistic model, we can do all kinds of useful things. For example, we could start a company that analyzes social media trends. Companies would hire us to track how their brand is perceived online and to alert them of negative trends in perception. No kidding! :)

To build that, we’d just scan for any tweets that mentioned our customer’s business. Then we’d feed all those tweets into the text classification model to predict if each user likes or dislikes the business. Once we have numerical ratings representing each user’s feelings, we could track changes of average score over time. We could even automatically trigger an action whenever someone posts something very negative about the business. Free start-up idea, just remember who gave you the idea :)

Why does this work? It seems too simple!

On it’s face, using text classification to understand text sounds like magical thinking. With a traditional NLP pipeline, we have to do a lot of work to understand the grammatical structure of text. With a classifier, we’re just throwing huge buckets of text into a wood chipper and hoping for the best. Isn’t human expression more nuanced and complex than that? This is the kind of over-hyping and over simplification that makes machine learning look bad, right?

There’s several reasons why treating text as a classification problem instead of as an understanding problem tends to work really well — even when using relatively simple linear classification models.

First, people constantly create and evolve language. Especially in an online word full of memes and emoji, writing code to reliably parse tweets and user reviews is going to be pretty difficult. This is the only time I hate memes! # Meme Review ¯\_(ツ)_/¯

With text classification, the algorithm doesn’t care whether the user wrote standard English, an emoji, or a reference to Goku. The algorithm is looking for statistical relationships between input phrases and outputs. If writing ಠ_ಠ correlates more heavily with 1-star and 2-star reviews, the algorithm will pick that up even though it has no idea what a “look of disapproval” emoticon is. The classifier can still figure out what characters mean in the context of where they appear and how often they contribute to a particular output.

Second, website users don’t always write in the specific language that you expect. An NLP pipeline trained to handle American English is going to fall apart if you give it German text. It’s also going to do poorly if your user decides to write their reviews with Cockney Rhyming Slang — which is stilltechnically English. ( ͡°( ͡° ͜ʖ( ͡° ͜ʖ ͡°)ʖ ͡°) ͡°)

Again, a classification algorithm doesn’t care what language the text is in as long as it can at least break apart the text into separate words and measure the effects of those words. As long as you give the classifier enough training data to cover a wide range of possible English and German user reviews, it will learn to handle both just fine.

And finally, a big reason that text classification is so great is because it is fast. Because linear text classification algorithms are so simple (compared to more complex machine learning models like recurrent neural networks), they can be trained quickly. You can train a linear classifier with gigabytes of text in minutes on a regular laptop. You don’t even need any fancy hardware like a GPU. So even if you can get a slightly better accuracy score with a different machine learning algorithm, sometimes the tradeoff isn’t worth it. And research has shown that often the accuracy gap is nearly zero anyway.

While text classification models are simple to set up, that’s not to say they are always easy to get working well. The big catch is that you need a lot of training data. If you don’t have enough training data to cover the wide range of the ways that people write things, the model won’t ever be very accurate. The more training data you can collect, the better the model will perform. The real art of applying text classification well is in finding clever ways of automatically collecting or creating training data.

What can you do with Text Classification?

We’ve seen that we can use text classification to automatically score a user’s review text. That’s a type of sentiment analysis. Sentiment analysis is where you look at text that a user wrote and you try to figure out if the user is feeling positive or negative.

There’s lots of other practical uses of text classification. One that you probably use every day as a consumer without knowing it is the email spam filtering feature built into your email service. If you have a group of real emails marked as “spam” or “not spam”, you can use those to train a classification model that automatically flags spam emails in the future:

Along the lines of spam filtering, you can also use text classification to identify abusive or obscene content and flag it. A lot of websites use text classification as a first-line defense against abusive users. By also taking the model’s confidence score into consideration, you can automatically block the worst offenders while sending the less certain cases to a human moderator to evaluate.

You can expand the idea of filtering beyond spam and abuse. More and more companies use of text classification to route support tickets. The goal is to parse support questions from users and route them to the right team based on the kind of issue that the user is most likely reporting:

By using classification to automate the busy work of triaging support tickets, the team is freed up to spend more time actually answering questions.

Text classification models can also be used to categorize pretty much anything. You can assume that any time you post on Facebook, behind the scenes it is classifying your post into categories like “family-related” or “related to a scheduled event”:

That not only helps Facebook know which content to show to which users, but it also lets them track the topics that you are most interested in for advertising purposes.

Classification is also useful for sorting and labeling documents. Imagine that your company has done thousands of consulting projects for clients but that your boss wants them all re-organized according to a new government-mandated project coding system. Instead of reading through every project’s summary document and trying to decide which project code is the best match, you could classify a random sampling of them by hand and then build a classification model to automatically code the remaining ones:

These are just a few ideas. The uses of text classification are endless. You just have to figure out a way to reframe the problem so that the information you are trying to extract from the text can be mapped into a set of discrete output classes.

You can even build systems where one classification model feeds into another classification model. Imagine a user support system where the first classifier guesses the user’s language (English or German), the second classifier guesses which team is best suited to handle their request and a third classifier guesses whether or not the user is already upset to choose a ticket priority code. You can get as complex as you want!

Now that you are convinced of the awesomeness of dumb text classification models, let’s learn exactly how to build them!

Go to article here:

https://kuldeepsinghsidhu.blogspot.com/2018/11/building-user-review-model-with.html

You can also find me on linkedin. I’d love to hear from you if I can help you or your team with machine learning.

Search This Blog

Kuldeep's Blog

Text Classification

Machine Learning Pipeline

Using Classification Models to Extract Meaning

Why does this work? It seems too simple!

What can you do with Text Classification?

Comments

Post a Comment

Popular Posts

TensorFlow or PyTorch? A Guide to Python Machine Learning Libraries

K-Nearest Neighbors (KNN) - The lazy learner