🤟🏼Natural Language Processing Unit 11 – Information Retrieval & Question Answering

Information retrieval and question answering are crucial components of natural language processing. These fields focus on finding relevant information from large collections of data and providing direct answers to user queries. IR techniques like indexing, query processing, and relevance ranking enable efficient searching. QA systems analyze questions, retrieve relevant passages, and extract or generate answers. Advanced models and evaluation metrics continually improve these systems' performance.

Study Guides for Unit 11

11.1

Text indexing and retrieval models

5 min read

11.2

Query understanding and expansion

5 min read

11.3

Passage retrieval and ranking

5 min read

11.4

Question answering systems

5 min read

Key Concepts

Information retrieval (IR) focuses on finding relevant information from large collections of unstructured data (documents, web pages, images, videos)
Query processing involves understanding and transforming user queries into a format suitable for searching
Indexing organizes and structures data for efficient retrieval (inverted index, forward index)
- Inverted index maps terms to the documents containing them enabling fast keyword-based search
- Forward index stores the complete text of each document allowing for more advanced search capabilities
Relevance ranking determines the order of search results based on their relevance to the query (TF-IDF, BM25)
Evaluation metrics assess the effectiveness of IR systems (precision, recall, F1-score, MAP, NDCG)
Question answering (QA) aims to provide direct answers to user questions by understanding the intent and context
Advanced techniques improve IR and QA performance (query expansion, semantic search, neural ranking models)

Information Retrieval Basics

Information need represents the user's underlying goal or purpose for searching
Document collection refers to the set of documents or data sources being searched (web pages, books, articles)
Retrieval models define how documents are matched and ranked against queries (Boolean, vector space, probabilistic)
- Boolean model uses logical operators (AND, OR, NOT) to match documents based on the presence or absence of query terms
- Vector space model represents documents and queries as vectors in a high-dimensional space and measures their similarity (cosine similarity)
Relevance feedback incorporates user feedback to refine search results and improve query understanding
Stemming reduces words to their base or root form (running, runs, ran → run) to improve matching and recall
Stop words are common words (the, and, of) that are often removed during indexing and searching to improve efficiency
Tokenization breaks text into individual words or tokens for further processing and analysis

Query Processing and Expansion

Query understanding involves analyzing the user's query to determine the intent, context, and key information needs
Query normalization standardizes the query by converting to lowercase, removing punctuation, and handling special characters
Query expansion techniques add related terms to the original query to improve recall and coverage
- Synonym expansion adds words with similar meanings (car → automobile, vehicle) to capture different ways of expressing the same concept
- Thesaurus-based expansion leverages external knowledge sources (WordNet) to identify related terms and concepts
Relevance feedback can be used to expand queries based on user interactions (clicks, ratings) or explicit feedback
Pseudo-relevance feedback automatically expands queries using top-ranked documents assuming they are relevant
Query suggestion provides alternative or related queries to help users refine their search (autocomplete, "did you mean")
Query reformulation involves modifying the query based on user feedback, query performance analysis, or domain knowledge

Indexing and Search Algorithms

Indexing process converts raw data into a structured format optimized for efficient retrieval
Tokenization breaks text into individual words or tokens as the basic units for indexing
Normalization applies techniques like lowercasing, stemming, and stop word removal to standardize tokens
Inverted index is a key data structure that maps terms to the documents containing them enabling fast keyword search
- Term frequency (TF) represents how often a term appears in a document indicating its importance
- Inverse document frequency (IDF) measures the rarity of a term across the collection penalizing common words
Posting lists store the document IDs and term frequencies for each term in the inverted index
Index compression techniques (variable byte encoding, delta encoding) reduce the size of the index for efficient storage and retrieval
Search algorithms traverse the index to find documents matching the query and compute relevance scores
- Boolean retrieval matches documents that satisfy the logical conditions specified in the query
- Ranked retrieval assigns scores to documents based on their relevance to the query (TF-IDF, BM25, language models)

Evaluation Metrics

Precision measures the proportion of retrieved documents that are relevant to the query
- Formula: $\text{Precision} = \frac{\text{Number of relevant documents retrieved}}{\text{Total number of documents retrieved}}$
Recall measures the proportion of all relevant documents in the collection that are retrieved
- Formula: $\text{Recall} = \frac{\text{Number of relevant documents retrieved}}{\text{Total number of relevant documents in the collection}}$
F1-score is the harmonic mean of precision and recall providing a balanced measure of retrieval effectiveness
- Formula: $\text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
Average Precision (AP) measures the average of precision values at each relevant document in the ranked list
Mean Average Precision (MAP) computes the mean of AP scores across a set of queries
Normalized Discounted Cumulative Gain (NDCG) evaluates the quality of the ranking considering the position and relevance of documents
- Discounted Cumulative Gain (DCG) penalizes highly relevant documents appearing lower in the ranking
- NDCG normalizes DCG by the ideal DCG to enable comparison across different queries
Precision at k (P@k) measures precision considering only the top-k retrieved documents
Recall at k (R@k) measures recall considering only the top-k retrieved documents

Question Answering Systems

Question analysis parses the user's question to determine the question type, focus, and expected answer format
Question types include factoid (who, what, when, where), list, definition, and complex questions requiring reasoning
Information retrieval component searches the document collection to find relevant passages or documents for answering the question
Passage retrieval identifies the most relevant text segments within documents that are likely to contain the answer
Named entity recognition (NER) identifies and classifies named entities (persons, organizations, locations) in the retrieved passages
Coreference resolution links mentions of the same entity across sentences and documents to establish context
Answer extraction selects the most promising answer candidates from the retrieved passages based on the question type and context
Answer ranking scores and ranks the answer candidates based on their relevance, completeness, and coherence
Answer generation formulates a natural language response by combining information from multiple sources and ensuring fluency

Advanced Techniques and Models

Query likelihood model estimates the probability of a document generating the query treating the query as a sample from a document language model
BM25 is a probabilistic retrieval model that incorporates term frequency, document length normalization, and term importance
Learning to rank (LTR) uses machine learning to optimize the ranking function based on relevance features and user feedback
- Pointwise approaches predict the relevance score of each document independently (regression, classification)
- Pairwise approaches learn to rank by comparing pairs of documents and their relative relevance (RankNet, LambdaRank)
- Listwise approaches directly optimize the entire ranked list considering position-based metrics (ListNet, LambdaMART)
Neural ranking models leverage deep learning architectures (CNN, RNN, Transformer) to learn semantic representations and matching patterns
- DRMM (Deep Relevance Matching Model) captures term importance and query-document relevance using a neural network
- BERT (Bidirectional Encoder Representations from Transformers) pre-trained language model achieves state-of-the-art performance in various IR and QA tasks
Knowledge graphs represent entities and their relationships in a structured format enabling semantic search and reasoning
Open-domain question answering aims to answer questions from a large corpus without pre-defined domain restrictions

Practical Applications

Web search engines (Google, Bing) rely on information retrieval techniques to index and search billions of web pages
Enterprise search enables employees to find relevant documents, emails, and knowledge within an organization
E-commerce platforms (Amazon, eBay) use IR for product search, recommendation, and customer reviews analysis
Digital libraries and academic databases (Google Scholar, PubMed) facilitate the discovery and retrieval of scholarly literature
Personal assistants (Siri, Alexa) employ question answering to provide direct answers and perform tasks based on user queries
Chatbots and conversational agents use IR and QA to understand user intents and provide relevant responses
Legal and patent search helps lawyers and researchers find relevant cases, statutes, and patent documents
Social media monitoring analyzes user-generated content to track brand mentions, sentiment, and trending topics