Visualizing High Dimensional Image Clusters in 2D: The Growing Entourage Plot (Part II)

Damon Crockett

continued from Part I

Architecture Crop
Growing Entourage with 50 clusters of Instagram photos machine-tagged under the heading 'architecture'. Cropped.

Architecture Close
Closeup of plot immediately above.

Every image in a given cluster is ranked according to its Euclidean distance (in the original feature space) from the centroid. We can think of the centroid as the 'leader' of an 'entourage', and each image in the cluster is a member of the entourage. The closer they are to the centroid, by the aforementioned ranking, the closer they get to 'stand' near the centroid. Each cluster takes turns adding members of its 'entourage', starting with those closest to the leader. Each added member stands in the open grid space nearest its leader. Local conflicts between entourages are settled by this principle, since added members must occupy open grid squares.

 photo GE_slower_zpsjx3yaqmf.gif
50 image clusters ('entourages'), growing around their centroids, projected to 2D by PCA.

This means that the look of the plot will depend on how we generate the original grid. We might end up with an array of circular clusters in 2D, or we might end up with one large clump of images, with high-ranking members bunched up around their leaders and lower-ranking members scattered in nearby territories.

Activities (wide)
Growing Entourage with wide grid, resulting in relatively isolated circular clusters.

Growing Entourage using same data as above, but with a tighter grid. Some clusters are isolated, some have clumped with neighbors.

This is not, of course, the only way to present clusters on a 2D canvas. It is, however, probably the best way to preserve as much of the complexity of intercluster relations as is possible in 2D. Additionally, it preserves similarity relations among images in the original feature space, something we lose by pure projection methods. Finally, it preserves intracluster relations by giving the semantically closest entourage members the privileged locations nearest their leaders.

The plotting algorithm is written in Python, using the Python Imaging Library (and scikit-learn for projection), and the basic code is here.