Bytes
ArrowsREWIND Our 2024 Journey

How AlmaBetter created an

IMPACT! Arrows
Data Science

Introduction to NLP

Last Updated: 22nd May, 2023
icon

Sumanta Muduli

Data Scientist at Flutura Decision Sciences & Analytics at almaBetter

What are some of the applications of NLP..

What are some of the applications of NLP?

  1. Grammarly, Microsoft Word, Google Docs.
  2. Search engines like DuckDuckGo, Google
  3. Voice assistants — Alexa, Siri
  4. News feeds- Facebook, Google News
  5. Translation systems — Google translate

Why text preprocessing ?

Computers are great at working with structured data like spreadsheets and database tables, but we humans usually communicate in words, not in tables. Computers couldn’t understand those. To solve this problem, we have to come up with some advanced techniques. In NLP, we use some very smart techniques that convert languages to useful information like numbers or some mathematically interpretable objects so that we could use them in ML algorithms based upon our requirements.

Machine Learning needs data in numeric form. We first need to clean the textual data and this process to prepare(or clean) text data before encoding is called text preprocessing, this is the very first step to solve the NLP problems. SpaCy, NLTK are some libraries used to make our tasks of preprocessing easier.

Steps involved in preprocessing:

Cleaning

1. Removing URL-

1_8iZpGdPTldV7Ki_E5mVYdQ.png

Importing re library to remove URL.

Loading...
  1. Removing punctuations and numbers

Punctuation is basically the set of symbols [!”#$%&’()*+,-./:;<=>?@[]^_`{|}~]:

Loading...

3. Converting all to lower case

Loading...

4. Removing stopwords

Loading...

5. Tokenization-

It’s a method of splitting a string into smaller units called tokens. A token could be a punctuation, word, mathematical symbol, number etc.

1_GaMrBWjMxvVsMo3S9iLzng.png

Loading...

6. Stemming and Lemmatization-

  • Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations.
Loading...
  • Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. Look into the figure for clear picture.
Loading...

0_9BvN65J6sjvA3IzF.png

1_zfXXmhBs5Oe61KAmUzwl8A.png

7. Removing small words having length ≤2

After performing all required process in text processing there is some kind of noise is present in our corpus, so like that i am removing the words which have very short length.

0_u7_mWlExGor58thZ.png

Loading...

8. Convert the list into string back

0_aIU4HDsck_q86aWF.png

Loading...

Now we are all set to vectorize our text.

Vectorizing

  1. CountVectorizer- It converts a collection of text documents to a matrix of token counts: the occurrences of tokens in each document. This implementation produces a sparse representation of the counts.

0_73FCorAbb0sTI7u0.png

Loading...

2. TF-IDF: In TF-IDF we transform a count matrix to a normalized tf: term-frequency or term-frequency times inverse document-frequency representation using TfidfTransformer. The formula that is used to compute the tf-idf for a term t of a document d in a document set is:

0_9rDpN5R8aJMR7dqe.png

Loading...

Note-

In CountVectorizer we only count the number of times a word appears in the document which results in biasing in favour of most frequent words. This ends up in ignoring rare words which could have helped is in processing our data more efficiently.

To overcome this , we use TfidfVectorizer .

In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.

That’s all folks, Have a nice day ????

Free UpcomingMasterclass

How to Secure a High-Paying Career in 2025

Sat, 25 Jan 2025

2 PM - 4 PM

Online Live Session

Learn more

Hurry Up Limited Seats Available!

Alok Anand

Co-founder

Top Tutorials

Logo
Data Science

Python

Python is a popular and versatile programming language used for a wide variety of tasks, including web development, data analysis, artificial intelligence, and more.

8 Modules40 Lessons17254 Learners
Start Learning
Logo
Web Development

Javascript

JavaScript Fundamentals is a beginner-level course that covers the basics of the JavaScript programming language. It covers topics on basic syntax, variables, data types and various operators in JavaScript. It also includes quiz challenges to test your skills.

8 Modules37 Lessons6878 Learners
Start Learning
Logo
Data Science

SQL

The SQL for Beginners Tutorial is a concise and easy-to-follow guide designed for individuals new to Structured Query Language (SQL). It covers the fundamentals of SQL, a powerful programming language used for managing relational databases. The tutorial introduces key concepts such as creating, retrieving, updating, and deleting data in a database using SQL queries.

9 Modules40 Lessons7486 Learners
Start Learning
  • Official Address
  • 4th floor, 133/2, Janardhan Towers, Residency Road, Bengaluru, Karnataka, 560025
  • Communication Address
  • Follow Us
  • facebookinstagramlinkedintwitteryoutubetelegram

© 2025 AlmaBetter