Natural Language Processing

🤟🏼Natural Language Processing Unit 11 – Information Retrieval & Question Answering

Information retrieval and question answering are crucial components of natural language processing. These fields focus on finding relevant information from large collections of data and providing direct answers to user queries. IR techniques like indexing, query processing, and relevance ranking enable efficient searching. QA systems analyze questions, retrieve relevant passages, and extract or generate answers. Advanced models and evaluation metrics continually improve these systems' performance.

Key Concepts

  • Information retrieval (IR) focuses on finding relevant information from large collections of unstructured data (documents, web pages, images, videos)
  • Query processing involves understanding and transforming user queries into a format suitable for searching
  • Indexing organizes and structures data for efficient retrieval (inverted index, forward index)
    • Inverted index maps terms to the documents containing them enabling fast keyword-based search
    • Forward index stores the complete text of each document allowing for more advanced search capabilities
  • Relevance ranking determines the order of search results based on their relevance to the query (TF-IDF, BM25)
  • Evaluation metrics assess the effectiveness of IR systems (precision, recall, F1-score, MAP, NDCG)
  • Question answering (QA) aims to provide direct answers to user questions by understanding the intent and context
  • Advanced techniques improve IR and QA performance (query expansion, semantic search, neural ranking models)

Information Retrieval Basics

  • Information need represents the user's underlying goal or purpose for searching
  • Document collection refers to the set of documents or data sources being searched (web pages, books, articles)
  • Retrieval models define how documents are matched and ranked against queries (Boolean, vector space, probabilistic)
    • Boolean model uses logical operators (AND, OR, NOT) to match documents based on the presence or absence of query terms
    • Vector space model represents documents and queries as vectors in a high-dimensional space and measures their similarity (cosine similarity)
  • Relevance feedback incorporates user feedback to refine search results and improve query understanding
  • Stemming reduces words to their base or root form (running, runs, ran → run) to improve matching and recall
  • Stop words are common words (the, and, of) that are often removed during indexing and searching to improve efficiency
  • Tokenization breaks text into individual words or tokens for further processing and analysis

Query Processing and Expansion

  • Query understanding involves analyzing the user's query to determine the intent, context, and key information needs
  • Query normalization standardizes the query by converting to lowercase, removing punctuation, and handling special characters
  • Query expansion techniques add related terms to the original query to improve recall and coverage
    • Synonym expansion adds words with similar meanings (car → automobile, vehicle) to capture different ways of expressing the same concept
    • Thesaurus-based expansion leverages external knowledge sources (WordNet) to identify related terms and concepts
  • Relevance feedback can be used to expand queries based on user interactions (clicks, ratings) or explicit feedback
  • Pseudo-relevance feedback automatically expands queries using top-ranked documents assuming they are relevant
  • Query suggestion provides alternative or related queries to help users refine their search (autocomplete, "did you mean")
  • Query reformulation involves modifying the query based on user feedback, query performance analysis, or domain knowledge

Indexing and Search Algorithms

  • Indexing process converts raw data into a structured format optimized for efficient retrieval
  • Tokenization breaks text into individual words or tokens as the basic units for indexing
  • Normalization applies techniques like lowercasing, stemming, and stop word removal to standardize tokens
  • Inverted index is a key data structure that maps terms to the documents containing them enabling fast keyword search
    • Term frequency (TF) represents how often a term appears in a document indicating its importance
    • Inverse document frequency (IDF) measures the rarity of a term across the collection penalizing common words
  • Posting lists store the document IDs and term frequencies for each term in the inverted index
  • Index compression techniques (variable byte encoding, delta encoding) reduce the size of the index for efficient storage and retrieval
  • Search algorithms traverse the index to find documents matching the query and compute relevance scores
    • Boolean retrieval matches documents that satisfy the logical conditions specified in the query
    • Ranked retrieval assigns scores to documents based on their relevance to the query (TF-IDF, BM25, language models)

Evaluation Metrics

  • Precision measures the proportion of retrieved documents that are relevant to the query
    • Formula: Precision=Number of relevant documents retrievedTotal number of documents retrieved\text{Precision} = \frac{\text{Number of relevant documents retrieved}}{\text{Total number of documents retrieved}}
  • Recall measures the proportion of all relevant documents in the collection that are retrieved
    • Formula: Recall=Number of relevant documents retrievedTotal number of relevant documents in the collection\text{Recall} = \frac{\text{Number of relevant documents retrieved}}{\text{Total number of relevant documents in the collection}}
  • F1-score is the harmonic mean of precision and recall providing a balanced measure of retrieval effectiveness
    • Formula: F1-score=2×Precision×RecallPrecision+Recall\text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
  • Average Precision (AP) measures the average of precision values at each relevant document in the ranked list
  • Mean Average Precision (MAP) computes the mean of AP scores across a set of queries
  • Normalized Discounted Cumulative Gain (NDCG) evaluates the quality of the ranking considering the position and relevance of documents
    • Discounted Cumulative Gain (DCG) penalizes highly relevant documents appearing lower in the ranking
    • NDCG normalizes DCG by the ideal DCG to enable comparison across different queries
  • Precision at k (P@k) measures precision considering only the top-k retrieved documents
  • Recall at k (R@k) measures recall considering only the top-k retrieved documents

Question Answering Systems

  • Question analysis parses the user's question to determine the question type, focus, and expected answer format
  • Question types include factoid (who, what, when, where), list, definition, and complex questions requiring reasoning
  • Information retrieval component searches the document collection to find relevant passages or documents for answering the question
  • Passage retrieval identifies the most relevant text segments within documents that are likely to contain the answer
  • Named entity recognition (NER) identifies and classifies named entities (persons, organizations, locations) in the retrieved passages
  • Coreference resolution links mentions of the same entity across sentences and documents to establish context
  • Answer extraction selects the most promising answer candidates from the retrieved passages based on the question type and context
  • Answer ranking scores and ranks the answer candidates based on their relevance, completeness, and coherence
  • Answer generation formulates a natural language response by combining information from multiple sources and ensuring fluency

Advanced Techniques and Models

  • Query likelihood model estimates the probability of a document generating the query treating the query as a sample from a document language model
  • BM25 is a probabilistic retrieval model that incorporates term frequency, document length normalization, and term importance
  • Learning to rank (LTR) uses machine learning to optimize the ranking function based on relevance features and user feedback
    • Pointwise approaches predict the relevance score of each document independently (regression, classification)
    • Pairwise approaches learn to rank by comparing pairs of documents and their relative relevance (RankNet, LambdaRank)
    • Listwise approaches directly optimize the entire ranked list considering position-based metrics (ListNet, LambdaMART)
  • Neural ranking models leverage deep learning architectures (CNN, RNN, Transformer) to learn semantic representations and matching patterns
    • DRMM (Deep Relevance Matching Model) captures term importance and query-document relevance using a neural network
    • BERT (Bidirectional Encoder Representations from Transformers) pre-trained language model achieves state-of-the-art performance in various IR and QA tasks
  • Knowledge graphs represent entities and their relationships in a structured format enabling semantic search and reasoning
  • Open-domain question answering aims to answer questions from a large corpus without pre-defined domain restrictions

Practical Applications

  • Web search engines (Google, Bing) rely on information retrieval techniques to index and search billions of web pages
  • Enterprise search enables employees to find relevant documents, emails, and knowledge within an organization
  • E-commerce platforms (Amazon, eBay) use IR for product search, recommendation, and customer reviews analysis
  • Digital libraries and academic databases (Google Scholar, PubMed) facilitate the discovery and retrieval of scholarly literature
  • Personal assistants (Siri, Alexa) employ question answering to provide direct answers and perform tasks based on user queries
  • Chatbots and conversational agents use IR and QA to understand user intents and provide relevant responses
  • Legal and patent search helps lawyers and researchers find relevant cases, statutes, and patent documents
  • Social media monitoring analyzes user-generated content to track brand mentions, sentiment, and trending topics


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.