Sentiment analysis of Shopee layoffs using Reddit comments

Shopee pee pee pee pee pee

View source code

Disclaimer: All thoughts and opinions expressed herein are my own and do not necessarily reflect those of my employer.

Introduction

It's been a tough few months for Shopee.

The e-commerce platform has been laying off hundreds of employees over the last few months.

While the reasons behind the layoffs were unclear, many Reddit users are speculating about what could have led to the decision.

Can we find out how people really feel about Shopee's layoff spree from Reddit's comments?

But what does this have to do with sentiment analysis?

Well, the voting system on Reddit makes it easy to find out what most people think about any given topic.

If a certain piece of content has a large number of upvotes, it's a positive indication that people like it.

In the same way, if a lot of people vote "down" on something, it's a good sign that most people don't like it.

Of course, this isn't a perfect system, and there are always going to be outliers.

However, in general, Reddit's voting method provides a fairly accurate snapshot of how people feel about something.

So if you're looking to do some quick and dirty sentiment analysis, Reddit is a great place to start.

What is Sentiment Analysis?

Simply put, sentiment analysis is the process of computationally determining whether a piece of writing is positive, negative, or neutral.

This information can then be used to determine the general tone of a document, as well as the tone of specific passages within that document.

Consider the phrase "I'm OK with anything!" and "I'm OK with anything."

The tonal difference could result in either a slap or a kiss on the cheek when you question your visibly ravenous partner regarding dining options.

If you are constantly facing this dilemma, There's An App For That.

How is Sentiment Analysis Done?

There are a number of different techniques for performing sentiment analysis, each with its own set of benefits and drawbacks.

The most common way is to use some sort of natural language processing (NLP) in order to parse through a text and look for certain words or phrases that are indicative of positive or negative sentiment.

However, this method is often inaccurate because many words can have both positive and negative connotations depending on the context in which they are used.

Another way to classify the mood of a text is to use a predefined set of rules or an algorithm for machine learning.

This method can be more accurate than the last one, but it takes longer and requires a larger training dataset.

A third option is to use a hybrid strategy that combines rule-based methods with machine learning.

This approach has been shown to be effective and has the advantage of only needing a small amount of training data.

There are a number of software programs that can be used for sentiment analysis. Some of these programs are free, while others are commercial.

Free software programs include:

GATE: General Architecture for Text Engineering
Apache OpenNLP: Natural Language Processing
NLTK: Natural Language Toolkitf

Commercial software programs include:

IBM Watson: Artificial intelligence
Rosette: Text analytics
Alteryx: Data analytics

Why is Sentiment Analysis Important?

Because it can help with making key decisions for a business or organisation.

For instance, if you are contemplating launching a new product, you can use sentiment analysis to gauge how people feel about similar offerings.

Or, for our test case, we could read the comments on Reddit about the recent Shopee layoffs to gather insights.

Data Preparation

First, let's import our libraries.

import asyncpraw
import json
import matplotlib.pyplot as plt
import numpy
import pandas as pd
from pandas import json_normalize

We will be using the asyncpraw library to retrieve comments from the Singapore subreddit.

You can obtain the id and secret key by registering for a Reddit app.

reddit = asyncpraw.Reddit(
 client_id='<Your-Client-ID>',
 client_secret='<Your-Client-Secret>',
 user_agent=f'<Your-User_Agent>'
)

## Get comments from a list of submission ids using asyncpraw
submission_list = ["x1hltt", "x2uw99", "x959w2", "vbsv76", "vd9xtw", "tq5on7", "x1hhq8"]
comment_list = []
for list in submission_list:
  submission = await reddit.submission(id=list)
  comments = await submission.comments()
  await comments.replace_more(limit=None)
  comment_queue = comments[:]
  while comment_queue:
      comment = comment_queue.pop(0)
      comment_list.append(comment.body)
      comment_queue.extend(comment.replies)

## Convert to dataframe
reddit_comments = [comments for comments in comment_list]
df = pd.DataFrame(reddit_comments)
df = df.rename({0: 'comments'}, axis=1)
df['id'] = df.index

id	comments
...	...
488	Using GPA as the primary deciding factor. Hear...
489	Are you referring to tech specifically or acro..
490	>Using GPA as the primary deciding factor.\n\n...
491	I'm referring to tech. FAANG, HFTs, some other...
492	I got into FAANG and HFT without a FCH. So yea...

Data Cleaning

Data cleaning is an essential part of data preparation, which can make or break the success of a data analysis project.

We identified a few characters in the comments that could cause problems for us later.

df['comments'] = df['comments'].map(
    lambda x: x
    .replace('##', '')
    .replace('**', '')
    .replace('\\', '')
    .replace(">", '')
    .replace("...", '')
    .replace("\n", ' ')
    .replace("[deleted]", '')
    )

id	comments
...	...
488	Using GPA as the primary deciding factor. Hear...
489	Are you referring to tech specifically or acro..
490	Using GPA as the primary deciding factor. Sorry...
491	I'm referring to tech. FAANG, HFTs, some other...
492	I got into FAANG and HFT without a FCH. So yea...

On to the fun stuff!

We will be using three different methods for sentiment analysis.

VADER (Valence Aware Dictionary for sEntiment Reasoning), The RoBERTa Transformers model and Transformers pipelines.

VADER (Valence Aware Dictionary for sEntiment Reasoning)

The VADER sentiment scoring approach to sentiment analysis is a statistical approach that is based on a lexicon of words that are used to predict the polarity of a text, which can be either positive, negative, or neutral.

The rules for scoring the sentiment of a text are based on the number of positive and negative words in the text, as well as the context of the words.

For example, words that are negated, such as "not" or "never", are scored as negative.

The VADER approach has been found to be extremely effective at identifying emotions in text, even when those emotions are mixed.

It can be used on social media posts, product reviews, and other user-generated content to quickly get a sense of the sentiment of the text.

However, the downside is that VADER can be fooled by sarcasm and irony, and it does not work well with longer texts.

Additionally, because it is based on a pre-trained model, it is not as customisable as some other sentiment analysis tools.

VADER Sentiment Analysis explained

Let's import the NLTK library and the required modules.

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
sentiment_intensity_analyzer = SentimentIntensityAnalyzer()

Next, we are going to calculate the polarity scores of the comments using NLTK's SentimentIntensityAnalyzer function.

The SentimentIntensityAnalyzer function accepts a string as input and returns a dictionary containing scores for each of the following four categories: positive, negative, neutral, and compound.

res = {}
for i, row in df.iterrows():
  comments = row['comments']
  id = row['id']
  res[id] = sentiment_analyzer.polarity_scores(comments)

id	neg	neu	pos	compound	comments
...	...	...	...	...	...
488	0.252	0.674	0.074	-0.7184	Using GPA as the primary deciding factor. Hear...
489	0.0	1.0	0.0	0.0	Are you referring to tech specifically or acro...
490	0.073	0.813	0.114	0.5499	Using GPA as the primary deciding factor. Sorry...
491	0.0	0.882	0.118	0.5267	I'm referring to tech. FAANG, HFTs, some other...
492	0.0	1.0	0.0	0.0	I got into FAANG and HFT without a FCH. So yea...

sentiments	comments
...	...
Neutral	Incorporated in Singapore, headquartered in Singapore, helmed by Singaporean (technically) but in truth a China company to the bone.
Neutral	Not really surprised looking at their stock price
Negative	Sucky company btw
Negative	fuck
Negative	Basically petty political bullshit
Positive	Well put.
Positive	Indeed, friend.

Results were mixed but acceptable.

Vader Pie Chart

RoBERTa Transformers

In recent years, rule-based systems and other conventional approaches to sentiment analysis have given way to more complex methods that employ deep learning techniques.

The RoBERTa Transformers model is an example of such a strategy.

It is a transformer-based model that was pre-trained on a large corpus of text.

The model can be fine-tuned for a variety of tasks, including sentiment analysis.

There are numerous benefits to employing the RoBERTa model for sentiment analysis.

For one thing, the model is very accurate.

It has been shown that it is more accurate than other models, such as the well-known BERT model.

Also, the RoBERTa Transformers model can be trained much more quickly than other models, which can save time and money.

The RoBERTa Transformers model does, however, have several drawbacks.

It takes a lot of computing power, which can make jobs like sentiment analysis very slow.

Additionally, the RoBERTa Transformers model is not easily interpretable.

It is difficult to understand why the RoBERTa Transformers model gives certain results and can be a problem when trying to improve the model.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

## turns numbers into probabilities which sum to one
from scipy.special import softmax

## text classification model that classifies text to emotions.
MODEL = "j-hartmann/emotion-english-distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

"""
    A custom function to compute polarity scores
    Returns: anger, disgust, fear, joy, neutral, sadness, surprise
"""
def return_polarity_scores(df):
  encoded_text = tokenizer(df, return_tensors='pt')
  output = model(**encoded_text)
  scores = output[0][0].detach().numpy()
  scores = softmax(scores)
  scores_dict = {
      'anger': scores[0],
      'disgust': scores[1],
      'fear': scores[2],
      'joy': scores[3],
      'neutral': scores[4],
      'sadness': scores[5],
      'surprise': scores[6],
  }
  return scores_dict

res = {}
for i, row in df.iterrows():
  try:
    comments = row['comments']
    id = row['id']
    res[id] = return_polarity_scores(comments)
  except RuntimeError:
    print(f'Error on row {i}')

id	anger	disgust	fear	joy	neutral	sadness	surprise	comments
...	...	...	...	...	...	...	...	...
488	0.4298040270805359	0.4912334978580475	0.004193916451185942	0.001422425382770598	0.05335167795419693	0.0170204546302557	0.002974043134599924	Using GPA as the primary deciding factor. Hear...
489	0.01266899611800909	0.00997665710747242	0.0038051940500736237	0.002049379050731659	0.8720080256462097	0.003747645765542984	0.09574423730373383	Are you referring to tech specifically or acro...
490	0.029277583584189415	0.04661332443356514	0.004849941004067659	0.004099743440747261	0.5986599922180176	0.2579329013824463	0.058566685765981674	Using GPA as the primary deciding factor. Sorry...
491	0.0064127808436751366	0.007289603352546692	0.0016105140093713999	0.004326922353357077	0.8848540782928467	0.00819186121225357	0.08731438219547272	I'm referring to tech. FAANG, HFTs, some other...
492	0.1379590630531311	0.15500691533088684	0.013789849355816841	0.006102893501520157	0.3878730833530426	0.01567666418850422	0.2835916578769684	II got into FAANG and HFT without a FCH. So yea...

sentiments	comments
Anger	they get what they deserve.
Disgust	Shopee is an ass of a company. Comfirm they engange in slaving their employees away with the '996' culture.And yet I still buy from them.
Fear	The have the "power" to backlist him among the tech firm in China. It's something HR in China often threaten because apparently they know each other. Somehow. Guanxi you know.
Joy	I love the shoPEE notification!!!!
Neutral	Anyone know how to get in touch with those who were affected by Shopee's decisions?
Sadness	Sigh hard times for big tech :/
Surprised	Is shopee Chinese dominated? How come they can withdraw the offer ie it should have already been accepted before the guy was to start on 29th right?

Pretty impressive results.

Roberta Pie Chart

HuggingFace Transformers

If you're not familiar with HuggingFace's Transformers library, you might be wondering what all the fuss is about.

Transformers is a natural language processing (NLP) library that has recently become well-known for producing state-of-the-art results in a variety of NLP tasks.

Transformers can learn correlations between words in a sentence (or any sequence of data) that regular neural networks cannot, which allows them to produce highly accurate outcomes on a range of tasks.

Transformer networks are capable of learning associations between words in a sentence that are far apart.

This is because transformer networks are able to take into account the entire sequence of words when learning relationships.

While HuggingFace transformers provide numerous benefits for sentiment analysis, there are certain drawbacks to consider.

One such drawback is the transformer's potential overfitting to the training set of data.

This means that the transformer might not be able to generalize well to new data and accurately predict the tone of new documents.

In addition, the transformer may be slow to train and necessitate a substantial amount of training data.

Finally, the transformer may be difficult to interpret and may not provide clear explanations of the predictions it makes.

from transformers import pipeline

"""
  This is a roBERTa-base model trained on ~124M tweets from January 2018 to December 2021
  , and finetuned for sentiment analysis with the TweetEval benchmark.
  https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest
"""
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"
sent_pipeline = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)

res = {}
for i, row in df.iterrows():
  try:
    comments = row['comments']
    id = row['id']
    res[id] = sent_pipeline(comments)
  except RuntimeError:
    print(f'Error on row {i}')

id	label	score	comments
...	...	...	...
486	Negative	0.8902063965797424	still good la hahaaameans shopee really tighte...belts
487	Neutral	0.9093933701515198	The way it works is that they get an IPA (in-p...
488	Negative	0.6428775787353516	Using GPA as the primary deciding factor. Hear...
489	Neutral	0.8256815075874329	Are you referring to tech specifically or acro...
490	Neutral	0.8007738590240479	Using GPA as the primary deciding factor.Sorry...

sentiments	comments
...	...
Negative	Ultimate douchebags, they think they're on the same league as FAAANG and can play this kinda stun I would stay the fk away unless you have an addiction to pain and suffering, cus even if u were to work for them, it's 996
Negative	Shopee is a company which I will never consider joining, regardless of how "good" the compensation can be if there is any to be left with.They don't deserve our local tech talents. Not now, not ever.
Negative	just how disorganized is this company? lol
Neutral	I work in FAANG, everyone in the software industry in Singapore knows that the first company to avoid is Shopee.
Neutral	70% of their workforce is at Guangzhou anyway
Positive	I interviewed with Ian Ho MD of Shopee Singapore. He is a stuck up cunt.
Positive	Shoppeee doesn't care? Everyone knows is Shiat, but cheap? The same applies to China, for example. 🤷🏾PR _really_ doesn't matter when you're cheap as fuck
Positive	show pee pee pee pee peeeeeeeeeeee
Positive	Back in 2014, when tech was not so exciting, their remuneration for Tech roles was some of the best as a graduate hireRight now, I understand it is like 10% over last drawn.

Mixed bag of results.

Transformer Pie Chart

Cloudy with a Chance of Meatballs

A word cloud is a visual representation of some text as a bunch of words based on a weight associated with how often the words appear in the source text.

Word clouds can be a fun and easy way to visualize text data.

Before we generate our wordcloud, we are going to tokenise each word in the comments column and remove all stop words using the NLTK library.

nltk.download('stopwords')
from nltk.corpus import stopwords
en_stops = stopwords.words('english')

from wordcloud import WordCloud

pipeline_df_loc_all_tokenized = [",".join([str(item.lower()) for item in pipeline_df['comments']])]

wordcloud_all = WordCloud(
    stopwords=en_stops,
    background_color='black',
    max_words=200
).generate(pipeline_df_loc_all_tokenized[0])

Wordcloud

Conclusion

One of the main problems with sentiment analysis is that it can be hard to tell when someone is being sarcastic or ironic.

This is because the algorithms that are used to evaluate text data are often not sophisticated enough to pick up on these kinds of nuances.

As a result, sentiment analysis can sometimes produce inaccurate results.

Another limitation of sentiment analysis is the use of slang and colloquialisms.

They are often used in social media posts, which can make sentiment analysis difficult.

For instance, while sentiment analysis is possible in Singlish, it is limited since the syntax is more complex than that of normal English.

"No Lah!" and "No Lah." have two different meanings.

Additionally, there are several loanwords from other languages, which can complicate sentiment analysis.

Despite these limitations, sentiment analysis is still a valuable tool for making business decisions.

View source code

Latest Posts

How Chat-GPT Replaced My Job

The Rise and Fall of AI Empires

GPT-3: The Latest Craze in NLP