Machines’ key to understanding humans: how I used natural language processing to analyze human sentiment and draw conclusions.

Adam Maj
10 min readDec 22, 2019

--

Today, there are an estimated 20 zettabytes of data available online. If you have absolutely no idea what a zettabyte is (just like me when I first read that stat), allow me to explain to you how huge that figure is.

Why don’t we start out by understanding what a byte is? The byte is one of the smallest measures of data. A single byte consists of one character and nothing more. The letter ‘a’ or the number ‘2’ or any other such character is a byte.

If we zoom out by a factor of 1000, we are now at a kilobyte. A kilobyte, or 1000 bytes, is the equivalent of an average page of text (which has around 1000 characters). If we zoom out another 1000x, we are at a megabyte which consists of the amount of data as a book with about 1000 pages.

Zooming out another 1000x brings you to a gigabyte which is approximately the amount of data necessary to store the human genome. Yet another 1000x will bring you to a terabyte, consisting of an amount of data equivalent to what you would get if you filmed and stored every moment of your entire life. Bear with me here cause we still got a few more to go.

If we zoom out another 1000x, we get to a petabyte of data. That’s about how much data you would get if you cut down every single one of the 390 billion trees in the Amazon rain forest (I’m not saying you should try this😬) and turned every tree into paper and then covered both sides of every single paper with text. That’s crazy. But wait, there’s more!

Now if you take 1000 of those amazon rain forests full of text, you would have an exabyte. I can’t even imagine how large that is. Finally, if you took another 1000 of those exabytes, you would have one zettabyte. And remember, there are twenty of those on the internet right now. That’s how much data is online right now — its an unimaginable amount.

Humans could never even hope to make use of this insane amount of data available to us. Luckily, we have machines to do just that. Over the past few decades, we have been developing technologies like machine learning to make use of this data to make predictions and solve problems.

However, traditional machine learning strategies do have their limits. They are predominantly effective and analyzing and crunching very structured data. When our data is in the form of numbers, it's very feasible to structure this data into a format that is easily understandable to machines. There’s just one problem here: so much of the data online is in the form of written text, not numbers (like this article, or anything else you’ve read online, and the millions of things you haven’t read online).

Now you may be wondering: Is there a way to build on traditional machine learning strategies to enable machines to analyze text? You may be happy to learn that the answer to that question is yes.

How can machines make sense of text and why should we want them to?

We can use a body of knowledge called natural language processing (NLP) to process verbal data into a format that machines can make use of. In many cases, we can also use a strategy called sentiment analysis to analyze this verbal data and draw conclusions from it. Sentiment analysis is the process of extracting subjective information from the text to gather data and identify the emotion and attitude behind a piece of writing.

It's important to note that natural language processing is a process that we can apply to any data in the form of text to properly prepare. Once this data is prepared, it is in a format suitable to use for things like machine learning algorithms to derive conclusions from them. With this joint use of NLP and ML, we can enable machines to make sense of human language.

The incredible thing about enabling machines to understand language is that the implications are much larger than just creating another glorified math equation to analyze data. The thing about language and text is that a large portion of all human knowledge is contained in some form of text and is available online. Thus, enabling machines to understand language is analogous to granting them access to the knowledge that humanity has gathered over hundreds of thousands of years. This clearly has immense potential.

Using natural language processing and machine learning to analyze customer sentiment.

Alright, now that we know what natural language processing is and why it's so useful, why don’t we take a look at a specific example where it can be used for good.

One of the most common places where NLP and sentiment analysis can be used most effectively relates to customer reviews. There are countless banks of reviews and ratings online about different places, services, and even people. Reviews can often be viewed as constructive criticism which can be used to make improvements to a product.

Take hotels for example. There are hundreds or even thousands of reviews online for each hotel where customers explain what they liked and didn’t like about the hotels. It would be useful for these hotels if they could gather all of the valuable information from these reviews to improve their amenities and services.

Unfortunately for the hotels, there is no efficient way to do this manually. Sitting and reading through hundreds of reviews is impractical. Now imagine how impractical it would be to read through reviews for more popular spots with thousands or even tens of thousands of reviews.

An example of how natural language processing can be used to help analyze important information in reviews and improve customer experience.

Luckily, we can use the intersection between natural language processing and machine learning to analyze such databases of reviews and draw conclusions in a matter of seconds!

We can use these algorithms to draw useful conclusions from reviews that can help to improve the product being reviewed. For example, we can use the combination of natural language processing and machine learning to figure out what people most commonly like and dislike about something. Picking out useful pieces of information like this is clearly valuable.

Let's look at a case where we use both natural language processing and machine learning to analyze customer sentiment in reviews. We can take a look at the process step by step to fully understand what’s actually going on.

Building the algorithm with NLP and ML

Let’s say we have a database of reviews about a product which consist of a verbal review about the product and a corresponding rating. We can use this data to make a useful algorithm. In our example, we’ll say that each review looks something like this:

Rating: 5/5
Review: I loved this product. I loved how easy it was to use and how convenient it was to bring it around. Had great usability. Would recommend it to anyone!

Currently, this is not in a format that is understandable to machines so we will have to turn it into the desired format. Let’s get dive right into the process!

1. Tokenization

The first step to processing our text is called tokenization. This is the process of splitting up the paragraphs and sentences in our review into individual words. This allows us to look at each word on its own to analyze how each word connects to the meaning of an entire block of text.

In its most basic form, tokenization is actually quite simple. The algorithm can just look for common characters that denote separate words. The most obvious of these is the space, which occurs between almost every word. Tokenization also splits up words that follow punctuation. There are some specific cases which are a bit more complex which tokenization deals with (for example, should the two parts of “Mr. Brown” be split up since there is technically a period between them) but overall, the process is simple.

2. Filtering Stopwords

The next step to our natural language processing involves filtering stopwords out of our reviews. Stopwords are words that add little to nothing to the meaning of statements and thus, are irrelevant from an analysis standpoint.

Here is a list of some of the most common stopwords in English

For example, words like “the” or “a” help sentences to flow better and make things sound nicer to readers. However, they add nothing to the meaning of reviews. A review that has the word “the” in it is no more likely to be a good or bad review than anything else. Stopwords tell are not at all correlated with the sentiment of the phrases that they are in and thus, can be removed.

This can be accomplished quite easily by removing all the words in each review that are found in the list of the most common stopwords in the English language.

3. Stemming & Lemmatization

The process of stemming is optional in many cases, but it often helps to improve the quality of the analysis. Stemming is the process where we reduce every word down to its most simple root or lemma. This helps to associate similar words with each other.

For example, the words “consult,” “consultant,” and “consulting” all have similar meanings. Our algorithm can tell this based on the fact that they all have the same root word. Thus, we can reduce all of these words down to their root “consult” to prevent our algorithm from getting confused into thinking that all of these are different concepts.

With all of this natural language processing complete, we can finally move on to applying machine learning to our data to draw conclusions!

4. Applying Machine Learning

Depending on what kinds of results we are trying to derive from our dataset, there are a variety of different machine learning algorithms that we could use. One common algorithm used in sentiment analysis is the Naive Bayes Algorithm which can help to correlate certain words with good and bad reviews. Like this algorithm, there are numerous others that can be used for these purposes.

How can YOU use NLP and sentiment analysis to your advantage?

Now that you know the basics of how natural language processing and sentiment analysis works, why don’t we talk about how you can actually implement these technologies for your own uses. Armed with these tools, you will be able to make the most of the data online to help you in your own endeavors.

There are currently a number of sentiment analysis tools online available that serve a variety of purposes from analyzing reviews to gathering social media sentiment. If you want to learn more about these tools and which one is the best for you to use, this article is a great place to start.

If you suspect that the technologies and tools that are already on the market are not suitable for you, there are a number of other options available. Most notably, you can create your own sentiment analysis algorithm from scratch and tailor it to your own needs. The most common way to do this is to use the programming language Python as there are already a number of tools for you to use with the language that will help you.

You can use web scraping tools like Scrapy to gather useful text data from different online sources in order to make a database for your sentiment analysis. Then, you can use Python tools like the Natural Language Toolkit (NLTK) and VADER to actually analyze the data. By making use of any of these tools and processes, you can fully utilize the massive amount of data available to you today.

Check out my project!

Recently, I used all of these concepts to create a sentiment analysis algorithm to help out Barclays Center in Brooklyn. I used my algorithm to analyze customer reviews to figure out what people most liked and disliked about Barclays Center. Make sure to check out my project below where I explain exactly how I made everything and how it works!

Learn about my sentiment analysis algorithm here!

As you can hopefully see, natural language processing has a lot of potential to help us analyze and make use of verbal data. Furthermore, it has even greater potential to make an impact and solve larger problems in the future!

Key Takeaways

  • There is so much data online and so much of that data is in the form of text. As a result, we need an effective way to process and make use of this text data.
  • Natural language processing can be used in parallel with machine learning to draw conclusions from verbal data.
  • We can use NLP processes like tokenization, stopword filtering, stemming, and lemmatization to prepare verbal data for analysis by traditional machine learning algorithms.
  • There are a number of different tools that you can use to create your own sentiment analysis algorithm or make use of previously existing ones.
  • The combination of both NLP and ML has immense potential to help us today and even greater potential to make an impact on the future.

Wait… don’t click away yet!

I’m Adam, a 16 year old passionate in technologies like artificial intelligence/machine learning, blockchain, quantum computing, and much more.

If you enjoyed learning about this topic or are interested in modern technologies and artificial intelligence, make sure to:

--

--