AI Engineering Academy · Lesson

Clustering and Visualizing Embeddings

Apply k-means clustering to a set of embeddings and visualize them in 2D using UMAP to discover natural topic groupings in your data.

Why Cluster Embeddings?

When you have hundreds or thousands of documents, you often want to discover what topics exist without manually reading everything. Clustering embeddings groups semantically similar documents together automatically, revealing the natural structure of your data.

Common applications include: auto-tagging support tickets, discovering content categories, finding redundant documents, and understanding what users ask about most.

K-Means Clustering Overview

K-means partitions n data points into k clusters by iteratively assigning each point to the nearest centroid, then recomputing centroids as the mean of assigned points. It converges when assignments stop changing.

For embeddings, k-means finds documents that are close together in the high-dimensional embedding space, effectively grouping them by semantic similarity.

from sklearn.cluster import KMeans
import numpy as np

# corpus_embeddings: (n_docs, 1536) — pre-computed
corpus_embeddings = np.random.randn(200, 1536)  # placeholder

kmeans = KMeans(n_clusters=5, random_state=42, n_init='auto')
kmeans.fit(corpus_embeddings)

labels = kmeans.labels_
print(f'Cluster assignments: {labels[:10]}')
print(f'Unique clusters: {set(labels)}')

All lessons in this course

← Back to AI Engineering Academy