[1412.5949] Large Scale Distributed Distance Metric Learning
In this paper, we reformulate the original DML problem into an unconstrained optimization problem that is amenable for parallelization, and use it as the basis for a distributed framework to support large scale distance metric learning

Abstract: In large scale machine learning and data mining problems with high feature
dimensionality, the Euclidean distance between data points can be
uninformative, and Distance Metric Learning (DML) is often desired to learn a
proper similarity measure (using side information such as example data pairs
being similar or dissimilar). However, high dimensionality and large volume of
pairwise constraints in modern big data can lead to prohibitive computational
cost for both the original DML formulation in Xing et al. (2002) and later
extensions. In this paper, we present a distributed algorithm for DML, and a
large-scale implementation on a parameter server architecture. Our approach
builds on a parallelizable reformulation of Xing et al. (2002), and an
asynchronous stochastic gradient descent optimization procedure. To our
knowledge, this is the first distributed solution to DML, and we show that, on
a system with 256 CPU cores, our program is able to complete a DML task on a
dataset with 1 million data points, 22-thousand features, and 200 million
labeled data pairs, in 15 hours; and the learned metric shows great
effectiveness in properly measuring distances.

‹Figure 1: Logical View of Parameter Server. A centralized server maintains the global parameter L. Each work p has a local copy Lp of the global parameter. In each iteration, each work takes a data pair and computes a gradient update 4Lp and sends 4Lp to the central server. The server aggregates the gradients received from workers and uses them to update the global parameter L. L is then pushed back to each worker and workers replace their local copy with L. (PARAMETER SYNCHRONIZATION)Figure 2: (a) Convergence curves on MNIST dataset. (b) Convergence curves on ImageNet-63K dataset. (c) Convergence curves on ImageNet-1M dataset. (Scalability of Our Framework)Figure 3: (a) Speedup on MNIST dataset. (b) Speedup on ImageNet-63K dataset. (c) Speedup on ImageNet-1M dataset. (Scalability of Our Framework)Figure 4: (a) Average precision versus running time on MNIST dataset. (b) Precision-recall curves on MNIST dataset. (c) Precision-recall curves on ImageNet-1M dataset. (Quality of the Learned Distance Metric)›