View source code
Disclaimer: All thoughts and opinions expressed herein are my own and do not necessarily reflect those of my employer.
It's been a tough few months for Shopee.
The e-commerce platform has been laying off hundreds of employees over the last few months.
While the reasons behind the layoffs were unclear, many Reddit users are speculating about what could have led to the decision.
Can we find out how people really feel about Shopee's layoff spree from Reddit's comments?
Well, the voting system on Reddit makes it easy to find out what most people think about any given topic.
If a certain piece of content has a large number of upvotes, it's a positive indication that people like it.
In the same way, if a lot of people vote "down" on something, it's a good sign that most people don't like it.
Of course, this isn't a perfect system, and there are always going to be outliers.
However, in general, Reddit's voting method provides a fairly accurate snapshot of how people feel about something.
So if you're looking to do some quick and dirty sentiment analysis, Reddit is a great place to start.
Simply put, sentiment analysis is the process of computationally determining whether a piece of writing is positive, negative, or neutral.
This information can then be used to determine the general tone of a document, as well as the tone of specific passages within that document.
Consider the phrase "I'm OK with anything!" and "I'm OK with anything."
The tonal difference could result in either a slap or a kiss on the cheek when you question your visibly ravenous partner regarding dining options.
If you are constantly facing this dilemma, There's An App For That.
There are a number of different techniques for performing sentiment analysis, each with its own set of benefits and drawbacks.
The most common way is to use some sort of natural language processing (NLP) in order to parse through a text and look for certain words or phrases that are indicative of positive or negative sentiment.
However, this method is often inaccurate because many words can have both positive and negative connotations depending on the context in which they are used.
Another way to classify the mood of a text is to use a predefined set of rules or an algorithm for machine learning.
This method can be more accurate than the last one, but it takes longer and requires a larger training dataset.
A third option is to use a hybrid strategy that combines rule-based methods with machine learning.
This approach has been shown to be effective and has the advantage of only needing a small amount of training data.
There are a number of software programs that can be used for sentiment analysis. Some of these programs are free, while others are commercial.
Free software programs include:
Commercial software programs include:
Because it can help with making key decisions for a business or organisation.
For instance, if you are contemplating launching a new product, you can use sentiment analysis to gauge how people feel about similar offerings.
Or, for our test case, we could read the comments on Reddit about the recent Shopee layoffs to gather insights.
First, let's import our libraries.
import asyncpraw
import json
import matplotlib.pyplot as plt
import numpy
import pandas as pd
from pandas import json_normalize
We will be using the asyncpraw library to retrieve comments from the Singapore subreddit.
You can obtain the id and secret key by registering for a Reddit app.
reddit = asyncpraw.Reddit(
client_id='<Your-Client-ID>',
client_secret='<Your-Client-Secret>',
user_agent=f'<Your-User_Agent>'
)
## Get comments from a list of submission ids using asyncpraw
submission_list = ["x1hltt", "x2uw99", "x959w2", "vbsv76", "vd9xtw", "tq5on7", "x1hhq8"]
comment_list = []
for list in submission_list:
submission = await reddit.submission(id=list)
comments = await submission.comments()
await comments.replace_more(limit=None)
comment_queue = comments[:]
while comment_queue:
comment = comment_queue.pop(0)
comment_list.append(comment.body)
comment_queue.extend(comment.replies)
## Convert to dataframe
reddit_comments = [comments for comments in comment_list]
df = pd.DataFrame(reddit_comments)
df = df.rename({0: 'comments'}, axis=1)
df['id'] = df.index
id | comments |
---|---|
... | ... |
488 | Using GPA as the primary deciding factor. Hear... |
489 | Are you referring to tech specifically or acro.. |
490 | >Using GPA as the primary deciding factor.\n\n... |
491 | I'm referring to tech. FAANG, HFTs, some other... |
492 | I got into FAANG and HFT without a FCH. So yea... |
Data cleaning is an essential part of data preparation, which can make or break the success of a data analysis project.
We identified a few characters in the comments that could cause problems for us later.
df['comments'] = df['comments'].map(
lambda x: x
.replace('##', '')
.replace('**', '')
.replace('\\', '')
.replace(">", '')
.replace("...", '')
.replace("\n", ' ')
.replace("[deleted]", '')
)
id | comments |
---|---|
... | ... |
488 | Using GPA as the primary deciding factor. Hear... |
489 | Are you referring to tech specifically or acro.. |
490 | Using GPA as the primary deciding factor. Sorry... |
491 | I'm referring to tech. FAANG, HFTs, some other... |
492 | I got into FAANG and HFT without a FCH. So yea... |
We will be using three different methods for sentiment analysis.
VADER (Valence Aware Dictionary for sEntiment Reasoning), The RoBERTa Transformers model and Transformers pipelines.
The VADER sentiment scoring approach to sentiment analysis is a statistical approach that is based on a lexicon of words that are used to predict the polarity of a text, which can be either positive, negative, or neutral.
The rules for scoring the sentiment of a text are based on the number of positive and negative words in the text, as well as the context of the words.
For example, words that are negated, such as "not" or "never", are scored as negative.
The VADER approach has been found to be extremely effective at identifying emotions in text, even when those emotions are mixed.
It can be used on social media posts, product reviews, and other user-generated content to quickly get a sense of the sentiment of the text.
However, the downside is that VADER can be fooled by sarcasm and irony, and it does not work well with longer texts.
Additionally, because it is based on a pre-trained model, it is not as customisable as some other sentiment analysis tools.
VADER Sentiment Analysis explained
Let's import the NLTK library and the required modules.
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
sentiment_intensity_analyzer = SentimentIntensityAnalyzer()
Next, we are going to calculate the polarity scores of the comments using NLTK's SentimentIntensityAnalyzer function.
The SentimentIntensityAnalyzer function accepts a string as input and returns a dictionary containing scores for each of the following four categories: positive, negative, neutral, and compound.
res = {}
for i, row in df.iterrows():
comments = row['comments']
id = row['id']
res[id] = sentiment_analyzer.polarity_scores(comments)
id | neg | neu | pos | compound | comments |
---|---|---|---|---|---|
... | ... | ... | ... | ... | ... |
488 | 0.252 | 0.674 | 0.074 | -0.7184 | Using GPA as the primary deciding factor. Hear... |
489 | 0.0 | 1.0 | 0.0 | 0.0 | Are you referring to tech specifically or acro... |
490 | 0.073 | 0.813 | 0.114 | 0.5499 | Using GPA as the primary deciding factor. Sorry... |
491 | 0.0 | 0.882 | 0.118 | 0.5267 | I'm referring to tech. FAANG, HFTs, some other... |
492 | 0.0 | 1.0 | 0.0 | 0.0 | I got into FAANG and HFT without a FCH. So yea... |
sentiments | comments |
---|---|
... | ... |
Neutral | Incorporated in Singapore, headquartered in Singapore, helmed by Singaporean (technically) but in truth a China company to the bone. |
Neutral | Not really surprised looking at their stock price |
Negative | Sucky company btw |
Negative | fuck |
Negative | Basically petty political bullshit |
Positive | Well put. |
Positive | Indeed, friend. |
In recent years, rule-based systems and other conventional approaches to sentiment analysis have given way to more complex methods that employ deep learning techniques.
The RoBERTa Transformers model is an example of such a strategy.
It is a transformer-based model that was pre-trained on a large corpus of text.
The model can be fine-tuned for a variety of tasks, including sentiment analysis.
There are numerous benefits to employing the RoBERTa model for sentiment analysis.
For one thing, the model is very accurate.
It has been shown that it is more accurate than other models, such as the well-known BERT model.
Also, the RoBERTa Transformers model can be trained much more quickly than other models, which can save time and money.
The RoBERTa Transformers model does, however, have several drawbacks.
It takes a lot of computing power, which can make jobs like sentiment analysis very slow.
Additionally, the RoBERTa Transformers model is not easily interpretable.
It is difficult to understand why the RoBERTa Transformers model gives certain results and can be a problem when trying to improve the model.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
## turns numbers into probabilities which sum to one
from scipy.special import softmax
## text classification model that classifies text to emotions.
MODEL = "j-hartmann/emotion-english-distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
"""
A custom function to compute polarity scores
Returns: anger, disgust, fear, joy, neutral, sadness, surprise
"""
def return_polarity_scores(df):
encoded_text = tokenizer(df, return_tensors='pt')
output = model(**encoded_text)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
scores_dict = {
'anger': scores[0],
'disgust': scores[1],
'fear': scores[2],
'joy': scores[3],
'neutral': scores[4],
'sadness': scores[5],
'surprise': scores[6],
}
return scores_dict
res = {}
for i, row in df.iterrows():
try:
comments = row['comments']
id = row['id']
res[id] = return_polarity_scores(comments)
except RuntimeError:
print(f'Error on row {i}')
id | anger | disgust | fear | joy | neutral | sadness | surprise | comments |
---|---|---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... | ... | ... |
488 | 0.4298040270805359 | 0.4912334978580475 | 0.004193916451185942 | 0.001422425382770598 | 0.05335167795419693 | 0.0170204546302557 | 0.002974043134599924 | Using GPA as the primary deciding factor. Hear... |
489 | 0.01266899611800909 | 0.00997665710747242 | 0.0038051940500736237 | 0.002049379050731659 | 0.8720080256462097 | 0.003747645765542984 | 0.09574423730373383 | Are you referring to tech specifically or acro... |
490 | 0.029277583584189415 | 0.04661332443356514 | 0.004849941004067659 | 0.004099743440747261 | 0.5986599922180176 | 0.2579329013824463 | 0.058566685765981674 | Using GPA as the primary deciding factor. Sorry... |
491 | 0.0064127808436751366 | 0.007289603352546692 | 0.0016105140093713999 | 0.004326922353357077 | 0.8848540782928467 | 0.00819186121225357 | 0.08731438219547272 | I'm referring to tech. FAANG, HFTs, some other... |
492 | 0.1379590630531311 | 0.15500691533088684 | 0.013789849355816841 | 0.006102893501520157 | 0.3878730833530426 | 0.01567666418850422 | 0.2835916578769684 | II got into FAANG and HFT without a FCH. So yea... |
sentiments | comments |
---|---|
Anger | they get what they deserve. |
Disgust | Shopee is an ass of a company. Comfirm they engange in slaving their employees away with the '996' culture.And yet I still buy from them. |
Fear | The have the "power" to backlist him among the tech firm in China. It's something HR in China often threaten because apparently they know each other. Somehow. Guanxi you know. |
Joy | I love the shoPEE notification!!!! |
Neutral | Anyone know how to get in touch with those who were affected by Shopee's decisions? |
Sadness | Sigh hard times for big tech :/ |
Surprised | Is shopee Chinese dominated? How come they can withdraw the offer ie it should have already been accepted before the guy was to start on 29th right? |
If you're not familiar with HuggingFace's Transformers library, you might be wondering what all the fuss is about.
Transformers is a natural language processing (NLP) library that has recently become well-known for producing state-of-the-art results in a variety of NLP tasks.
Transformers can learn correlations between words in a sentence (or any sequence of data) that regular neural networks cannot, which allows them to produce highly accurate outcomes on a range of tasks.
Transformer networks are capable of learning associations between words in a sentence that are far apart.
This is because transformer networks are able to take into account the entire sequence of words when learning relationships.
While HuggingFace transformers provide numerous benefits for sentiment analysis, there are certain drawbacks to consider.
One such drawback is the transformer's potential overfitting to the training set of data.
This means that the transformer might not be able to generalize well to new data and accurately predict the tone of new documents.
In addition, the transformer may be slow to train and necessitate a substantial amount of training data.
Finally, the transformer may be difficult to interpret and may not provide clear explanations of the predictions it makes.
from transformers import pipeline
"""
This is a roBERTa-base model trained on ~124M tweets from January 2018 to December 2021
, and finetuned for sentiment analysis with the TweetEval benchmark.
https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest
"""
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"
sent_pipeline = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
res = {}
for i, row in df.iterrows():
try:
comments = row['comments']
id = row['id']
res[id] = sent_pipeline(comments)
except RuntimeError:
print(f'Error on row {i}')
id | label | score | comments |
---|---|---|---|
... | ... | ... | ... |
486 | Negative | 0.8902063965797424 | still good la hahaaameans shopee really tighte...belts |
487 | Neutral | 0.9093933701515198 | The way it works is that they get an IPA (in-p... |
488 | Negative | 0.6428775787353516 | Using GPA as the primary deciding factor. Hear... |
489 | Neutral | 0.8256815075874329 | Are you referring to tech specifically or acro... |
490 | Neutral | 0.8007738590240479 | Using GPA as the primary deciding factor.Sorry... |
sentiments | comments |
---|---|
... | ... |
Negative | Ultimate douchebags, they think they're on the same league as FAAANG and can play this kinda stun I would stay the fk away unless you have an addiction to pain and suffering, cus even if u were to work for them, it's 996 |
Negative | Shopee is a company which I will never consider joining, regardless of how "good" the compensation can be if there is any to be left with.They don't deserve our local tech talents. Not now, not ever. |
Negative | just how disorganized is this company? lol |
Neutral | I work in FAANG, everyone in the software industry in Singapore knows that the first company to avoid is Shopee. |
Neutral | 70% of their workforce is at Guangzhou anyway |
Positive | I interviewed with Ian Ho MD of Shopee Singapore. He is a stuck up cunt. |
Positive | Shoppeee doesn't care? Everyone knows is Shiat, but cheap? The same applies to China, for example. 🤷🏾PR _really_ doesn't matter when you're cheap as fuck |
Positive | show pee pee pee pee peeeeeeeeeeee |
Positive | Back in 2014, when tech was not so exciting, their remuneration for Tech roles was some of the best as a graduate hireRight now, I understand it is like 10% over last drawn. |
A word cloud is a visual representation of some text as a bunch of words based on a weight associated with how often the words appear in the source text.
Word clouds can be a fun and easy way to visualize text data.
Before we generate our wordcloud, we are going to tokenise each word in the comments column and remove all stop words using the NLTK library.
nltk.download('stopwords')
from nltk.corpus import stopwords
en_stops = stopwords.words('english')
from wordcloud import WordCloud
pipeline_df_loc_all_tokenized = [",".join([str(item.lower()) for item in pipeline_df['comments']])]
wordcloud_all = WordCloud(
stopwords=en_stops,
background_color='black',
max_words=200
).generate(pipeline_df_loc_all_tokenized[0])
One of the main problems with sentiment analysis is that it can be hard to tell when someone is being sarcastic or ironic.
This is because the algorithms that are used to evaluate text data are often not sophisticated enough to pick up on these kinds of nuances.
As a result, sentiment analysis can sometimes produce inaccurate results.
Another limitation of sentiment analysis is the use of slang and colloquialisms.
They are often used in social media posts, which can make sentiment analysis difficult.
For instance, while sentiment analysis is possible in Singlish, it is limited since the syntax is more complex than that of normal English.
"No Lah!" and "No Lah." have two different meanings.
Additionally, there are several loanwords from other languages, which can complicate sentiment analysis.
Despite these limitations, sentiment analysis is still a valuable tool for making business decisions.
View source code