Import and Trim data that is unnecessary
Even before we begin cleaning the data, it is crucial to understand the problem statement, as we always need to clean the data based on the problem statement. This article is about cleaning Twitter data. To understand if a tweet is positive or negative based on the Tweet. The imported data must be cleaned before it can be used for sentiment classification.
For Example,
URL, mentions, hashtags, slang words, white spaces, spelling mistakes, etc.,
We will use the Twitter data from Kaggle To understand every topic we will use throughout the article.
Importing the Library

In the above code, we can understand that we have imported all the necessary libraries, which include Pandas (which is used for importing the data and Data understanding), NLTK( used for cleaning the text data like URL and hashtags), and Sklearn(is used for building algorithms).
Reading the Dataset


From the above code and output, you can see that I am reading Twitter data from Kaggle, which currently has 1600000 samples available. As a result of using the pandas head command, I can see the top five rows with all the columns.
Cleaning the data for Next Step

With the above code, we can obtain information about the data, such as the number of columns, rows, and data types.
We need only two columns for the problem statement: sentiment and tweet columns. We will drop the remaining columns by using the following command.

Based on the above image, we can see that the tweet column is the input column, based on which we decide whether the tweet is positive or negative.

we have two types in sentiment column which are positive and negative that are denoted by 0 and 4. Doesn’t that seem a little odd? Using the pandas replace method, we will now change 0 and 4 to 0 and 1. What is
- -> (Negative)
- -> (positive)
Currently, we are learning NLP concepts by building a project. Rather than using 16,00,000 rows, which would consume considerable time. For learning purposes, we will reduce it to 10,000 rows.

As you can see in the code above, we have done three things. In the first two lines, we separate the positive and negative tweets. Our data has been stored in variables, which allows us to reduce the number of rows accordingly. There are now 5000 samples for positive and 5000 samples for negative tweets, which means the new dataset we are creating is balanced.
Till now we have imported the data and then dropped few columns we don’t need and then we have selected 5000 rows for learning purpose. Now we can start cleaning the actually tweets for model building process.
It is time to clean up your text
From now on, the Tweets column will actually be cleaned

Using the image above, we can see how we will clean the tweets and then normalize the data so that we can build the model on top of it.
Removing Email

In the above code we have used python’s ‘re’ module for doing pattern matching. ‘re’ module refers to regular expressions, it is very useful for tasks such as pattern matching like (emails,url,phone numbers).
Still not clear with the definition?
For a better understanding, let me provide you with an example.
Ms. Sonakshi Gupta
22/1, lucky Road
Kiran Park
Pune, Maharashtra
411002
India
In the example above, the address is in Indian format. Let’s say I have 10 or 500 addresses in the same format as above; from those addresses, I need to extract First name, last name, Pin code, and country and create a new column with the names First name, last name, Pin code, country. It is possible to do that using the ‘re’ module, which we refer to as regular expressions.
After applying regular expression, we can see an output like below.

we can extract anything from the database like we did above. If we wanted to find emails from lakhs of textual records, regular expressions make it much easier.
To gain a better understanding of that, you can visit multiple different websites. Here are two websites where you can learn more about regular expressions.
- RegexLearn
- Regex 101
Finally, in the code above, we have used the regular expression to remove the emails from the tweets, since they do not contribute to the meaning of the tweet/text.
Removing URL

Occasionally, people may share multiple URLs in a tweet, including
- Movie URL’s
- Company website
- Youtube Video URL
The list goes on and on. In the above code, we are removing all of the URLs in the tweets
Removing Numbers:

The code above can be used to remove the numbers from tweets. Because keeping the numbers in tweets does not contribute to the problem statement. Now, you may have a doubt that sometimes numbers and URLs matter. You are right, but for this problem, we may not need numbers, so we are removing them. But numbers, URLs, and Emails may be useful for other problem statements, not for this.
Removing Hashtags or Special characters

A tweet’s message does not depend on special characters to determine whether it is positive or negative.
To conclude on the text cleaning, I have done each of the above methods separately to enable you to understand them better.
Normalizing your document’s text

Once we have completed the required cleaning, we will begin text normalization. In one method, we can perform all of the methods mentioned. Each step of this process will be discussed in detail.

Let us take a closer look at the loop and what we are doing within it one by one
We are applying some methods to each tweet in the tweets column
- Lowering the case of the tweet. What is the purpose of doing this? This is part of the pre-processing process. This will ensure that there is no confusion during the modelling process.
Tokenization
Tokenization involves breaking down a text document into smaller units known as tokens.
What are tokens?
Tokens are nothing but words.
Input: “This is an example. It breaks the text into individual words or tokens.”
Output: [‘This’, ‘is’, ‘an’, ‘example’, ‘.’, ‘It’, ‘breaks’, ‘the’, ‘text’, ‘into’, ‘individual’, ‘words’, ‘or’, ‘tokens’, ‘.’]
Stop Words
Before discussing tokenization, let us take a moment to consider stopwords and understand what they are. Stop words are words that are commonly used in a language but do not contribute to its meaning. The following are examples of stop words in English: “a”, “an”, “the”, “and”, “but”, “in”, “at”, “for”, “with”, etc.
Stemming and Lemmatization
Stemming and Lemmatization are techniques used in Natural Language Processing (NLP) to reduce a word to its root or base form

Stemming: In stemming, suffixes are removed from a word by applying a number of rules (“ing”, “ly”, “es”, “s” etc.)
For example: Saw -> Saw

Lemmatization: it refers to the process of obtaining a word’s root form in an organized manner. In addition to vocabulary analysis (dictionary importance), it incorporates morphological analysis (word structure and grammar relations).
For Example Saw – > See
Which one to choose? It always depends on the nature of the problem statement that we are dealing with. It is necessary to test both in order to determine which is most suitable for our problem statement.

After cleaning and normalizing your data, it will appear like this. Let’s dive deep into it and continue building upon it.
There are various vectorization techniques that can be used

Why Vectorization Techniques? Many machine learning models are not capable of understanding text data directly. So, they are used to convert raw text data into numerical vectors so that machine learning algorithms can process them. In conclusion, vectorization techniques are used to represent text data as a set of numerical features that can be used to train a machine learning models.

Bag of Words (BoW): One way to analyze a document or tweet is by counting the frequency of each word and presenting them as a vector. It’s important to keep in mind that the order of the words does not affect this technique.

This picture shows how the Bag of words works, as you can see from the image above.
Advantages of BoW:
The Bag of Words model is a popular and efficient approach in NLP. It counts the frequency of each word in a document, making it fast and suitable for large datasets. Also, it doesn’t need any specific linguistic knowledge and can be applied to any language.
Disadvantages of BoW
The Bag-of-Words (BoW) model doesn’t consider the sequence of words in a document, treating each word as a distinct feature. Because of that, important contextual meaning, such as idiomatic expressions and figures of speech, are lost, which can affect understanding.
When working with a vast dataset, the vocabulary used in the BoW model can become extensive. Consequently, the resulting matrix is often sparse, requiring more memory and processing power to handle.
How to divide that data to X and Y
Now We will train the data using naïve Bayes Classifier Using the python code below

After building the model we will be using the confusion matric to check how accurate our model to classify weather a tweet is positive or not.

Output

The accuracy achieved is better, and we can enhance it further by implementing various techniques described in the above-mentioned image, such as TF-IDF and Word2vec. The purpose of this article was to gain a fundamental understanding of NLP while developing a project using ML. model.
conclusion
In this blog post, we delved into the topic of Twitter sentiment analysis and explored the various techniques and methodologies involved in building a sentiment analysis model. Our first step was to understand the basics of Natural Language Processing and its relevance in sentiment analysis.
We then went on to cover two significant techniques in text preprocessing – text cleaning and text normalization. These techniques help to transform raw text data into a structured format that is easier to analyze.
Lastly, we learned about the bag of words approach as part of the vectorization techniques. This technique involves representing text data as a vector of the frequency of words used in the text. We used Python to demonstrate how this technique could be implemented and utilized in sentiment analysis.