Swivel uses stochastic gradient descent to perform a weighted approximate matrix factorization, ultimately arriving at embeddings that reconstruct the point-wise mutual information (PMI) between each row and column feature. Swivel uses a piecewise loss function to differentiate between observed and unobserved cooccurrences.
Tomas Mikolov even told me that the whole idea behind word2vec was to demonstrate that you can get better word representations if you trade the model's complexity for efficiency, i.e. the ability to learn from much bigger datasets. Omer Levy on Quora
Questions?