📊Predictive Analytics in Business Unit 6 – Text Mining & Natural Language Processing
Text mining and NLP are powerful tools for extracting insights from unstructured text data. These techniques enable businesses to analyze large volumes of text, such as social media posts, customer reviews, and emails, to gain actionable insights.
Applications span various domains including marketing, customer service, healthcare, and finance. Key concepts include tokenization, stop word removal, stemming, and named entity recognition. Advanced techniques like word embeddings and transformer models further enhance text analysis capabilities.
Tokenization: breaking down text into smaller units (words, phrases, or sentences) for analysis
Stop word removal: eliminating common words (the, and, is) that do not contribute to the meaning of the text
Stemming: reducing words to their base or root form (running, runs, ran -> run) to normalize the text
Lemmatization: converting words to their dictionary form (better, best -> good) considering the context
Part-of-speech (POS) tagging: identifying the grammatical role of each word in a sentence (noun, verb, adjective)
Named Entity Recognition (NER): identifying and classifying named entities (person, organization, location) in the text
Term frequency-inverse document frequency (TF-IDF): a numerical statistic that reflects the importance of a word in a document within a collection of documents
N-grams: contiguous sequences of n items (words or characters) from a given text (unigrams, bigrams, trigrams)
NLP Techniques and Tools
Bag-of-words (BoW): representing text as a set of words, disregarding grammar and word order
Word embeddings: mapping words to dense vector representations that capture semantic relationships (Word2Vec, GloVe)
Word2Vec: a neural network-based approach that learns word embeddings by predicting context words
GloVe: an unsupervised learning algorithm that generates word embeddings based on global word co-occurrence statistics
Topic modeling: discovering abstract topics in a collection of documents (Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF))
LDA: a generative probabilistic model that assumes each document is a mixture of topics, and each topic is a distribution over words
NMF: a matrix factorization technique that decomposes a document-term matrix into two non-negative matrices representing topics and their associated words
Sequence labeling: assigning a label to each element in a sequence of text (Hidden Markov Models (HMM), Conditional Random Fields (CRF))
Recurrent Neural Networks (RNNs): a class of neural networks designed to handle sequential data (Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU))
Transformer models: a type of neural network architecture that relies on self-attention mechanisms to process input sequences (BERT, GPT)
BERT: a pre-trained transformer model that can be fine-tuned for various NLP tasks (sentiment analysis, question answering)
GPT: a generative pre-trained transformer model that can be used for language generation and other NLP tasks
Data Preprocessing for Text Analysis
Text cleaning: removing irrelevant or noisy elements from the text (HTML tags, special characters, URLs)
Case normalization: converting all text to a consistent case (lowercase or uppercase) to ensure uniformity
Tokenization: splitting the text into individual words, phrases, or sentences
Stop word removal: filtering out common words that do not contribute to the meaning of the text
Stemming and lemmatization: reducing words to their base or dictionary form to normalize the text
Handling contractions: expanding contractions (don't -> do not) to standardize the text
Dealing with punctuation: removing or retaining punctuation depending on the specific task and requirements
Handling numbers and special characters: deciding whether to keep, remove, or normalize numbers and special characters based on their relevance to the analysis
Feature Extraction and Representation
Bag-of-words (BoW): representing text as a set of words, disregarding grammar and word order
Creates a sparse matrix where each row represents a document, and each column represents a word in the vocabulary
The values in the matrix can be binary (presence or absence of a word) or weighted (TF-IDF)
N-grams: considering contiguous sequences of n items (words or characters) from a given text
Unigrams: individual words
Bigrams: pairs of adjacent words
Trigrams: triplets of adjacent words
TF-IDF: a numerical statistic that reflects the importance of a word in a document within a collection of documents
Term frequency (TF): the frequency of a word in a document
Inverse document frequency (IDF): a measure of how rare a word is across all documents
TF-IDF weight: the product of TF and IDF, indicating the importance of a word in a document and its rarity in the corpus
Word embeddings: mapping words to dense vector representations that capture semantic relationships
Word2Vec: learns word embeddings by predicting context words given a target word (skip-gram) or predicting a target word given context words (continuous bag-of-words)
GloVe: learns word embeddings by factorizing a global word co-occurrence matrix
Text Classification and Clustering
Text classification: assigning predefined categories to text documents based on their content
Supervised learning approach: requires labeled training data
Common algorithms: Naive Bayes, Support Vector Machines (SVM), Logistic Regression, Random Forests