Statistical Prediction Unit 10 ReviewUnsupervised Learning: PCA & Clustering

Pep mascot
Upgrade your Fiveable account to print any study guide

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Click below to go to billing portal → update your plan → choose Yearly→ and select "Fiveable Share Plan". Only pay the difference

Plan is open to all students, teachers, parents, etc
Pep mascot
Upgrade your Fiveable account to export vocabulary

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Plan is open to all students, teachers, parents, etc

Unsupervised learning uncovers patterns in unlabeled data without predefined targets. It encompasses dimensionality reduction, which compresses high-dimensional data, and clustering, which groups similar data points. These techniques help manage complex datasets and reveal hidden structures. Principal Component Analysis (PCA) is a key dimensionality reduction method, while K-means and hierarchical clustering are popular clustering algorithms. These approaches find applications in customer segmentation, anomaly detection, and image compression, offering valuable insights across various fields.

unit 10 review

Key Concepts

  • Unsupervised learning extracts patterns and insights from unlabeled data without predefined target variables
  • Dimensionality reduction techniques compress high-dimensional data into lower-dimensional representations while preserving important information
    • Helps mitigate the curse of dimensionality and improves computational efficiency
  • Clustering algorithms group similar data points together based on their inherent characteristics and patterns
    • Enables the discovery of hidden structures and relationships within the data
  • Principal Component Analysis (PCA) is a linear dimensionality reduction technique that identifies the directions of maximum variance in the data
  • K-means and hierarchical clustering are popular partitioning and hierarchical clustering algorithms, respectively
  • Evaluation metrics such as silhouette score and Davies-Bouldin index assess the quality and validity of clustering results
  • Real-world applications of unsupervised learning include customer segmentation, anomaly detection, and image compression

Dimensionality Reduction Techniques

  • Dimensionality reduction maps high-dimensional data to a lower-dimensional space while retaining the most important information
  • Reduces the number of features or variables in the dataset, making it more manageable and computationally efficient
  • Helps alleviate the curse of dimensionality, which refers to the challenges posed by high-dimensional data (sparsity, increased computational complexity)
  • Linear techniques, such as PCA, project the data onto a lower-dimensional linear subspace
    • Identifies the directions of maximum variance in the data and constructs new features (principal components) along those directions
  • Non-linear techniques, like t-SNE and UMAP, capture complex non-linear relationships in the data
    • Preserve the local structure and neighborhood relationships of the data points in the lower-dimensional space
  • Dimensionality reduction facilitates data visualization, noise reduction, and feature extraction

Principal Component Analysis (PCA)

  • PCA is a linear dimensionality reduction technique that transforms the data into a new coordinate system
  • Identifies the directions of maximum variance in the data, known as principal components
    • The first principal component captures the most variance, followed by the second, and so on
  • Projects the data onto the principal components to obtain a lower-dimensional representation
  • Preserves the global structure of the data while minimizing the reconstruction error
  • The number of principal components can be selected based on the desired level of variance retention (explained variance ratio)
  • PCA is sensitive to the scale of the features, so data standardization is often performed beforehand
  • Limitations of PCA include its linearity assumption and inability to capture non-linear relationships

Clustering Methods

  • Clustering is an unsupervised learning task that involves grouping similar data points together based on their inherent characteristics
  • Partitioning clustering algorithms, such as K-means, assign each data point to a single cluster
    • Aim to minimize the within-cluster variation and maximize the between-cluster separation
  • Hierarchical clustering algorithms, like agglomerative and divisive clustering, create a hierarchy of clusters
    • Agglomerative clustering starts with each data point as a separate cluster and iteratively merges the closest clusters
    • Divisive clustering starts with all data points in a single cluster and recursively splits the clusters
  • Density-based clustering, such as DBSCAN, identifies clusters as dense regions separated by areas of lower density
  • Model-based clustering, like Gaussian Mixture Models (GMM), assumes that the data is generated from a mixture of probability distributions
  • The choice of clustering algorithm depends on the nature of the data, desired cluster shape, and prior knowledge about the number of clusters

K-means Algorithm

  • K-means is a popular partitioning clustering algorithm that aims to partition the data into K clusters
  • Initializes K cluster centroids randomly or using techniques like K-means++
  • Iteratively assigns each data point to the nearest centroid based on a distance metric (Euclidean distance)
    • Updates the cluster centroids by computing the mean of the data points assigned to each cluster
  • Repeats the assignment and update steps until convergence (centroids no longer change significantly) or a maximum number of iterations is reached
  • The objective is to minimize the sum of squared distances between data points and their assigned centroids (within-cluster sum of squares)
  • The choice of K, the number of clusters, is a hyperparameter that needs to be specified in advance
    • Techniques like the elbow method or silhouette analysis can help determine an appropriate value for K
  • K-means is computationally efficient and scales well to large datasets but is sensitive to the initial centroid positions

Hierarchical Clustering

  • Hierarchical clustering creates a tree-like structure (dendrogram) that represents the hierarchical relationships between clusters
  • Agglomerative clustering is a bottom-up approach that starts with each data point as a separate cluster
    • Iteratively merges the closest clusters based on a linkage criterion (single, complete, average, or Ward's linkage)
    • Continues merging clusters until a desired number of clusters is reached or all data points belong to a single cluster
  • Divisive clustering is a top-down approach that starts with all data points in a single cluster
    • Recursively splits the clusters into smaller subsets based on a splitting criterion
    • Continues splitting until a desired number of clusters is reached or each data point forms its own cluster
  • The dendrogram visualizes the merging or splitting process and allows for the selection of the number of clusters by cutting the dendrogram at a specific height
  • Hierarchical clustering does not require specifying the number of clusters in advance but can be computationally expensive for large datasets

Evaluation Metrics

  • Evaluation metrics assess the quality and validity of clustering results in the absence of ground truth labels
  • Internal evaluation metrics measure the compactness and separation of clusters based on the data itself
    • Silhouette coefficient computes the average silhouette score for each data point, indicating how well it fits into its assigned cluster compared to other clusters
    • Davies-Bouldin index measures the ratio of within-cluster distances to between-cluster distances, with lower values indicating better clustering
  • External evaluation metrics compare the clustering results to external ground truth labels, if available
    • Adjusted Rand Index (ARI) measures the similarity between the clustering and the ground truth, accounting for chance agreements
    • Normalized Mutual Information (NMI) quantifies the mutual information between the clustering and the ground truth, normalized by the entropy of both partitions
  • Visualization techniques, such as scatter plots or t-SNE embeddings, can provide qualitative insights into the clustering results
  • The choice of evaluation metric depends on the specific goals and characteristics of the clustering task

Real-world Applications

  • Customer segmentation in marketing divides customers into distinct groups based on their behavior, preferences, or demographics
    • Enables targeted marketing strategies and personalized recommendations
  • Anomaly detection identifies unusual or abnormal instances in the data that deviate from the normal patterns
    • Detects fraudulent transactions, network intrusions, or manufacturing defects
  • Image compression reduces the size of images by identifying and preserving the most important visual information
    • PCA can be used to compress images by projecting them onto a lower-dimensional space
  • Document clustering organizes a large collection of text documents into coherent groups based on their content similarity
    • Facilitates information retrieval, topic modeling, and content recommendation systems
  • Bioinformatics applies clustering techniques to analyze gene expression data, identify co-expressed genes, and discover biological pathways
  • Social network analysis uses clustering to identify communities or groups of closely connected individuals within a social network
  • Recommender systems employ clustering to group users or items with similar preferences and generate personalized recommendations