(S (NP Michael) (VP ate (NP the fish)) .)

(S(NP)(VP(NP)))=1.0\ (S(NP\_Michael)(VP(NP)))=1.0\- Matt Post and Shane Bergsma. “Explicit and Implicit Syntactic Features for Text Classfication”

(S (VP Do not (VP tell (NP them) (SBAR (S (NP our products) (VP contain (NP asbestos)))))) .) (S (VP Tell (NP them) (SBAR (S (NP our products) (VP do not (VP contain (NP asbestos)))))) .) (VP contain (NP asbestos))=1.0\ (VP do not (VP contain (NP asbestos)))=1.0\## Overfitting - With say 100s of documents and 10000s of features, overfitting is easy. - E.g., for binary classification with linear separator, if every document has a unique word, give $+$ weight to those for $+$ documents, $-$ weight to those of $-$ documents. Perfect fit! - For $n$ documents, linear classifier would need $n$ weights out of the 10000 to be nonzero. - Suppose features coded $\pm 1$. In an SVM, each weight would have to be $\pm 1$ to satisfy $y_i\w^T\x_i \ge 1$; norm of $\w$ is $\sqrt{n}$. - If *one* word can discriminate, can use weight vector with $||\w|| = 1$ - SVMs prefer sparse, simple solutions, can avoid overfitting even when $p >> n$. # Clustering ## What is clustering? - Clustering is grouping similar objects together. - To establish prototypes, or detect outliers. - To simplify data for further analysis/learning. - To visualize data (in conjunction with dimensionality reduction) - Clusterings are usually not "right“ or "wrong” – different clusterings/clustering criteria can reveal different things about the data. ## Clustering Algorithms - Clustering algorithms: - Employ some notion of distance between objects - Have an explicit or implicit criterion defining what a good cluster is - Heuristically optimize that criterion to determine the clustering - Some clustering criteria/algorithms have natural probabilistic interpretations ## $K$-means clustering \newcommand{\x}{\mathbf{x}} \newcommand{\v}{\mathbf{v}} \newcommand{\b}{\mathbf{b}} - One of the most commonly-used clustering algorithms, because it is easy to implement and quick to run. - Assumes the objects (instances) to be clustered are $p$-dimensional vectors, $\x_i$. - Uses a distance measure between the instances (typically Euclidean distance) - The goal is to *partition* the data into $K$ disjoint subsets ## $K$-means clustering - Inputs: - A set of $p$-dimensional real vectors $\{\x_1, \x_2, \ldots, \x_n\}$. - $K$, the desired number of clusters. - Output: A mapping of the vectors into $K$ clusters (disjoint subsets), $C:\{1,\ldots,n\}\mapsto\{1,\ldots,K\}$. 1. Initialize $C$ randomly. 2. Repeat: 1. Compute the *centroid* of each cluster (the mean of all the instances in the cluster) 2. Reassign each instance to the cluster with closest centroid until $C$ stops changing. ## Example: initial data ## Example: assign into 3 clusters randomly ## Example: compute centroids ## Example: reassign clusters ## Example: recompute centroids ## Example: reassign clusters ## Example: recompute centroids – done! ## What is the right number of clusters? ## Example: assign into 4 clusters randomly ## Example: compute centroids ## Example: reassign clusters ## Example: recompute centroids ## Example: reassign clusters ## Example: recompute centroids – done! ## Assessing the quality of the clustering - If used as a pre-processing step for supervised learning, measure the performance of the supervised learner - Measure the "tightness" of the clusters: points in the same cluster should be close together, points in different clusters should be far apart - Tightness can be measured by the minimum distance, maximum distance or average distance between points - **Silhouette criterion** is sometimes used - Problem: these measures usually favour large numbers of clusters, so some form of complexity penalty is necessary ## Typical applications of clustering - Pre-processing step for supervised learning - Data inspection/experimental data analysis - Discretizing real-valued variables in non-uniform buckets. - Data compression ## Questions - Will $K$-means terminate? - Will it always find the same answer? - How should we choose the initial cluster centers? - Can we automatically choose the number of centers? ## Does $K$-means clustering terminate? - For given data $\{\x_1,\ldots,\x_n\}$ and a clustering $C$, consider the sum of the squared Euclidean distance between each vector and the center of its cluster: $$J = \sum_{i=1}^n\|\x_i-\mu_{C(i)}\|^2~,$$ where $\mu_{C(i)}$ denotes the centroid of the cluster containing $\x_i$. - There are finitely many possible clusterings: at most $K^n$. - Each time we reassign a vector to a cluster with a nearer centroid, $J$ decreases. - Each time we recompute the centroids of each cluster, $J$ decreases (or stays the same.) - Thus, the algorithm must terminate. ## Does $K$-means always find the same answer? - $K$-means is a version of coordinate descent, where the parameters are the cluster center coordinates, and the assignments of points to clusters. - It minimizes the sum of squared Euclidean distances from vectors to their cluster centroid. - This error function has many local minima! - The solution found is *locally optimal*, but *not globally optimal* - Because the solution depends on the initial assignment of instances to clusters, random restarts will give different solutions ## Example - Same problem, different solutions ------------------------------------------------- -------------------------- $J=0.22870$ $J=0.3088$ ------------------------------------------------- -------------------------- ## Choosing the number of clusters - A difficult problem. - Delete clusters that cover too few points - Split clusters that cover too many points - Add extra clusters for "outliers" - Add option to belong to “no cluster” - Minimum description length: minimize loss + complexity of the clustering - Use a hierarchical method first ## Why Euclidean distance? Subjective reason: It produces nice, round clusters. ![image](images/Cluster_K4_Step7_WithEllipses.pdf.svg) ## Why *not* Euclidean distance? 1. It produces nice round clusters! 2. Differently scaled axes can dramatically affect results. 3. There may be symbolic attributes, which have to be treated differently ## Agglomerative clustering - Input: Pairwise distances $d(\x,{\bf x'})$ between a set of data objects $\{\x_i\}$. - Output: A hierarchical clustering 1. Assign each instance as its own cluster on a working list $W$. 2. Repeat 1. Find the two clusters in $W$ that are most "similar". 2. Remove them from $W$. 3. Add their union to $W$. until $W$ contains a single cluster with all the data objects. 3. Return *all clusters* appearing in $W$ at any stage of the algorithm. ## How many clusters after iteration $k$? - Answer: $n-k$, where $n$ is the number of data objects. - Why? - The working list $W$ starts with $n$ singleton clusters - Each iteration removes two clusters from $W$ and adds one new cluster - The algorithm stops when $W$ has one cluster, which is after $k = n-1$ iterations ## How do we measure dissimilarity between clusters? - Distance between nearest objects ("Single-linkage“ agglomerative clustering, or "nearest neighbor”): $$\min_{\x\in C, {\bf x'}\in C'} d(\x,{\bf x'})$$ - Average distance between objects ("Group-average" agglomerative clustering): $$\frac{1}{|C||C'|}\sum_{\x\in C, {\bf x'}\in C'} d(\x,{\bf x'})$$ - Distance between farthest objects ("Complete-linkage“ agglomerative clustering, or "furthest neighbor”): $$\max_{\x\in C, {\bf x'}\in C'} d(\x,{\bf x'})$$ ## Example 1: Data ---------------------------------------------- ----------------------------------------------- --------------------- Single Average Complete ---------------------------------------------- ----------------------------------------------- --------------------- ## Example 1: Iteration 30 ---------------------------------------------- ----------------------------------------------- --------------------- Single Average Complete ---------------------------------------------- ----------------------------------------------- --------------------- ## Example 1: Iteration 60 ---------------------------------------------- ----------------------------------------------- --------------------- Single Average Complete ---------------------------------------------- ----------------------------------------------- --------------------- ## Example 1: Iteration 70 ---------------------------------------------- ----------------------------------------------- --------------------- Single Average Complete ---------------------------------------------- ----------------------------------------------- --------------------- ## Example 1: Iteration 78 ---------------------------------------------- ----------------------------------------------- --------------------- Single Average Complete ---------------------------------------------- ----------------------------------------------- --------------------- ## Example 1: Iteration 79 ---------------------------------------------- ----------------------------------------------- --------------------- Single Average Complete ---------------------------------------------- ----------------------------------------------- --------------------- ## Example 2: Data ---------------------------------------------- ----------------------------------------------- --------------------- Single Average Complete ---------------------------------------------- ----------------------------------------------- --------------------- ## Example 2: Iteration 50 ---------------------------------------------- ----------------------------------------------- --------------------- Single Average Complete ---------------------------------------------- ----------------------------------------------- --------------------- ## Example 2: Iteration 80 ---------------------------------------------- ----------------------------------------------- --------------------- Single Average Complete ---------------------------------------------- ----------------------------------------------- --------------------- ## Example 2: Iteration 90 ---------------------------------------------- ----------------------------------------------- --------------------- Single Average Complete ---------------------------------------------- ----------------------------------------------- --------------------- ## Example 2: Iteration 95 ---------------------------------------------- ----------------------------------------------- --------------------- Single Average Complete ---------------------------------------------- ----------------------------------------------- --------------------- ## Example 2: Iteration 99 ---------------------------------------------- ----------------------------------------------- --------------------- Single Average Complete ---------------------------------------------- ----------------------------------------------- --------------------- ## Intuitions about cluster similarity - Single-linkage - Favors spatially-extended / filamentous clusters - Often leaves singleton clusters until near the end - Complete-linkage favors compact clusters - Average-linkage is somewhere in between ## Summary of $K$-means clustering - Fast way of partitioning data into $K$ clusters - It minimizes the sum of squared Euclidean distances to the clusters centroids - Different clusterings can result from different initializations - Natural way to add new points to existing clusters. ## Summary of Hierarchical clustering - Organizes data objects into a tree based on similarity. - Agglomerative (bottom-up) tree construction is most popular. - There are several choices of distance metric (linkage criterion) - No natural way to find which cluster a new point should belong to ## Hierarchical Clustering ```{r echo=F,message=F,warning=F} library(ggplot2) ``` ## Hierarchical Clustering ```{r} fclust <- hclust(dist(faithful), "ave"); faithful$clust <- as.factor(cutree(fclust,2)) ggplot(faithful,aes(x=eruptions,y=waiting,colour=clust,shape=clust)) + geom_point() ``` ## Hierarchical Clustering ```{r} fclust <- hclust(dist(faithful), "ave"); faithful$clust <- as.factor(cutree(fclust,5)) ggplot(faithful,aes(x=eruptions,y=waiting,colour=clust,shape=clust)) + geom_point() ``` ## Hierarchical Clustering ```{r warning=F} fclust <- hclust(dist(faithful), "ave"); faithful$clust <- as.factor(cutree(fclust,10)) ggplot(faithful,aes(x=eruptions,y=waiting,colour=clust)) + geom_point() ``` ## Scale matters! ```{r} sfaithful <- scale(faithful[,c(1,2)]) head(sfaithful,4) attr(sfaithful,"scaled:center") attr(sfaithful,"scaled:scale") ``` ## Hierarchical Clustering ```{r echo=F,message=F,warning=F} library(ggplot2) ``` ```{r warning=F} fclust <- hclust(dist(sfaithful), "ave"); faithful$clust <- as.factor(cutree(fclust,10)) ggplot(faithful,aes(x=eruptions,y=waiting,colour=clust)) + geom_point() ``` ## Hierarchical Clustering ```{r} fclust <- hclust(dist(sfaithful), "ave"); faithful$clust <- as.factor(cutree(fclust,5)) ggplot(faithful,aes(x=eruptions,y=waiting,colour=clust,shape=clust)) + geom_point() ``` ## Hierarchical Clustering ```{r} fclust <- hclust(dist(sfaithful), "ave"); faithful$clust <- as.factor(cutree(fclust,2)) ggplot(faithful,aes(x=eruptions,y=waiting,colour=clust,shape=clust)) + geom_point() ``` ## Dendrogram ```{r warning=F} library(ggdendro) ggdendrogram(fclust, leaf_labels=F, labels=F) ``` # Dimensionality reduction ## Dimensionality Reduction - Dimensionality reduction (or embedding) techniques: - Assign instances to real-valued vectors, in a space that is much smaller-dimensional (even 2D or 3D for visualization). - Approximately preserve similarity/distance relationships between instances - Sometimes, retain the ability to (approximately) reconstruct the original instances - Clustering can be thought of this way ## Dimensionality Reduction Techniques - Axis-aligned: "Feature selection" (See Guyon video on Wiki, other materials.) - Linear: **Principal components analysis** - Non-linear - Kernel PCA - Independent components analysis - Self-organizing maps - Multi-dimensional scaling - **t-SNE: t-distributed Stochastic Neighbour Embedding** ## "True dimensionality" of this dataset? ## "True dimensionality" of this dataset? - You may give me a model with $\ll n$ parameters ahead of time. - How many additional numbers must you send to tell me approximately where a particular data point is? ## "True dimensionality" of this dataset? ## "True dimensionality" of this dataset? ## "True dimensionality" of this dataset? ## Remarks - All dimensionality reduction techniques are based on an implicit assumption that the data lies near some *low-dimensional manifold* - This is the case for the first three examples, which lie along a 1-dimensional manifold despite being plotted in 2D - In the last example, the data has been generated randomly in 2D, so no dimensionality reduction is possible without losing a lot of information ## Principal Component Analysis (PCA) - Given: $n$ instances, each being a length-$p$ real vector. - Suppose we want a 1-dimensional representation of that data, instead of $p$-dimensional. - Specifically, we will: - Choose a line in ${\mathbb{R}}^{p}$ that "best represents" the data. - Assign each data object to a point along that line. - Identifying a point on a line just requires a scalar: How far along the line is the point? ## Which line is best? ## How do we assign points to lines? ## Reconstruction error - Let the line be represented as ${\bf b}+\alpha \v$ for ${\bf b},\v\in{\mathbb{R}}^p$, $\alpha\in{\mathbb{R}}$.\ For convenience assume $\|\v\|=1$. - Each instance $\x_i$ is associated with a point on the line $\hat\x_i={\bf b}+\alpha_i\v$. - Instance $\x_i$ is *encoded* as a scalar $\alpha_i$ ## Example data ## Example with $\v \propto (1,0.3)$ ## Example with $\v \propto (1,-0.3)$ ## Minimizing reconstruction error - We want to choose ${\bf b}$, $\v$, and the $\alpha_i$ to minimize the total reconstruction error over all data points, measured using Euclidean distance: $$R=\sum_{i=1}^n\|\x_i-\hat\x_i\|^2$$ -------- ----------------------------------------------------- ----- min $\sum_{i=1}^n\|\x_i-({\bf b} + \alpha_i \v)\|^2$ w.r.t. ${\bf b}, \v, \alpha_i,i=1,\dots n$ s.t. $\|\v\|^2=1$ -------- ----------------------------------------------------- ----- ## Solving the PCA optimization problem [HTF Ch. 14.5] -------- ----------------------------------------------------- ----- min $\sum_{i=1}^n\|\x_i-({\bf b} + \alpha_i \v)\|^2$ w.r.t. ${\bf b}, \v, \alpha_i,i=1,\dots n$ s.t. $\|\v\|^2=1$ -------- ----------------------------------------------------- ----- - Turns out the optimal $\mathbf{b}$ is just the sample mean of the data, ${\bf b}=\frac{1}{n}\sum_{i=1}^n \x_i$ - This means that the best line goes through the mean of the data. Typically, we subtract the mean first. Assuming it’s zero: -------- ----------------------------------------- ------ min $\sum_{i=1}^n\|\x_i-\alpha_i \v\|^2$ w.r.t. $\v, \alpha_i,i=1,\dots n$ s.t. $\|\v\|^2=1$ -------- ----------------------------------------- ------ - Consider fixing ${\mathbf{v}}$. The optimal $\alpha_i$ is given by *projecting* $\x_i$ onto $\mathbf{v}$. ## Example with optimal line: ${\bf b}=(0.54,0.52)$, $\v\propto(1,0.45)$ ## Reduction to $d$ dimensions - ${\bf b}$, $\v$, and the $\alpha_i$ can be computed easily in polynomial time. The $\alpha_i$ give a 1D representation. - More generally, we can create a $d$-dimensional representation of our data by projecting the instances onto a hyperplane ${\bf b}+\alpha^1\v_1+\ldots+\alpha^d\v_d$. - If we assume the $\v_j$ are of unit length and orthogonal, then the optimal choices are: - ${\bf b}$ is the mean of the data (as before) - The $\v_j$ are orthogonal eigenvectors of $X^\T X$ corresponding to its $d$ largest eigenvalues. - Each instance is projected orthogonally on the hyperplane. ## Singular Value Decomposition - ${\bf b}$, the eigenvalues $\lambda$, the $\v_j$, and the projections of the instances can all be computing in polynomial time, e.g. using (thin) Singular Value Decomposition.\ $$X_{n\times p} = {\color{MyRed}U_{n\times p}} {\color{MyBlue}D_{p \times p}} {\color{MyGreen}V_{p \times p}^\T}$$ - Columns of $U$ are left-eigenvectors, diagonal of $D$ are sqrts of eigenvalues (“singular values”), $V$ are right-eigenvectors - Typically $D$ is sorted by magnitude. ***First $d$ columns of $U$ are new representation of $X$.*** \newcommand{\u}{\mathbf{u}} - To encode new column vector $\x$ as a vector $\u$: $\u = D^{-1} V^\T \x$, take first $d$ elements. ## Eigenvalue Magnitudes - The magnitude of the $j^{th}$-largest eigenvalue, $\lambda_j$, tells how much variability in the data is captured by the $j^{th}$ principal component - When the eigenvalues are sorted in decreasing order, the proportion of the variance captured by the first $d$ components is: $$\frac{\lambda_1 + \dots + \lambda_d}{\lambda_1 + \dots + \lambda_d + \lambda_{d+1} + \dots + \lambda_n}$$ - So if a "big" drop occurs in the eigenvalues at some point, that suggests a good dimension cutoff ## Example: $\lambda_1=0.0938, \lambda_2=0.0007$ ## Example: $\lambda_1=0.1260, \lambda_2=0.0054$ ## Example: $\lambda_1=0.0884, \lambda_2=0.0725$ ## Example: $\lambda_1=0.0881, \lambda_2=0.0769$ ## More remarks - Outliers have a big effect on the covariance matrix, so they can affect the eigenvectors quite a bit - A simple examination of the pairwise distances between instances can help discard points that are very far away (for the purpose of PCA) - If the variances in the original dimensions vary considerably, they can "muddle" the true directions. Typically we normalize each dimension prior to PCA. - In certain cases, the eigenvectors are meaningful; e.g. in vision, they can be displayed as images ("e.g. eigenfaces") ## Uses of PCA - Pre-processing for a supervised learning algorithm, e.g. for image data, robotic sensor data - Used with great success in image and speech processing - Visualization - Exploratory data analysis - Removing the linear component of a signal (before fancier non-linear models are applied) ## Visualization: Iris dataset ```{r} library(GGally) ggpairs(iris, aes(colour = Species, alpha=0.4),lower = list(combo = wrap("facethist", binwidth = 0.5))) ``` ## Visualization: Iris dataset ```{r} library(ggplot2) pca <- prcomp(iris[,1:4], scale. = TRUE, retx = TRUE) irispca <- cbind(iris, pca$x) ggplot(data = irispca, aes(x = PC1, y = PC2, colour = Species)) + geom_point() ``` ## Beyond PCA: Nonlinear dimensionality reduction - Kernel PCA (but you don’t get the eigenvectors) - Self-Organizing Maps - Isomap - Locally Linear Embedding -