Online demo of t-SNE visualization you can see here.
Machine learning algorithms have been put to good use in various areas for several years already. Analysis of various political events can become one of such areas. For instance, it can be used for predicting voting results, developing mechanisms for clustering the decisions made, analysis of political actors’ actions. In this article, I will try to describe the result of a research in this area.

Problem Definition  

Modern machine learning capabilities allow converting and visualizing huge amounts of data. Thereby it became possible to analyze political parties’ activities by converting voting instances that took place during 4 years into a self-organizing space of points that reflects actions of each elected official.

Each politician expressed themselves via 12 000 voting instances. Each voting instance can represent one of five possible actions (the person was absent, skipped the voting, voted approval, voted in negative, abstained).

The task is to convert the results of all voting instances into a point in the 3D Euclidean space that will reflect some considered attitude.

Open Data

The original data was taken from the official website and converted into intermediate data for a neural network.

Autoencoder

Considering the problem definition, it is necessary to represent 12 000 voting instances as a vector of the 2 or 3 dimension. Humans can operate 2- or 3-dimension spaces, and it is quite difficult to imagine more spaces.

Let’s apply autoencoder to decrease the capacity.

Simple Auto-encoder.

 

The autoencoder is based on two functions:

\(h = e\left(x \right)\)  – encoding function;

\(x’ = d(h)\) – decoding function;

The initial vector \(x\) with dimension \(m\) is supplied to the neural network as an input, and the network converts it into the value of the hidden layer \(h\) with dimension \(n\). After that the neural network decoder converts the value  of the hidden layer \(h\) into an output vector \(x\) with dimension \(m\), while \(m > n\).  That is, in the result the hidden layer \(h\) will be of lesser dimension, while being able to display all the range of the initial data.

Objective cost function is used for exercising the network:

\(L=(x, x’)=(x, d(e(x))\)

In other words, the difference between the values of the input and output layers is minimized. Exercised neural network allows compressing the dimension of the initial data to some dimension \(n\) on the hidden layer \(h\) .

On the figure, you can see one input layer, one hidden layer and one output layer. There can be more such layers in a real-case scenario.

Now we are finished with the theoretical part, let’s do some practice.

The data has been collected from the official site in the JSON format, and encoded into a vector already.

Input data encoding to vector.

Input data encoding to vector.

Now there is a dataset with dimension 24000 x 453. Let’s create a neural network using the TensorFlow means:

All project code available on GitHub page.

The network will be exercised by the RMSProb optimizer with learning rate 0.01. In the result, you can see the TensorFlow operation chart:

Autoencoder TensorFlow Graph.

Autoencoder TensorFlow Graph.

For extra testing purposes, let’s select the first four vectors and render their values as images on the neural network input and output. This way you can ensure that the values of the input and output layers are “identical” (to a tolerance).

Initial input data

Initial input data

Values of the neural network output layer

Values of the neural network output layer

Now let’s gradually pass all input data to the neural network and extract values of the hidden layer. These values are the compressed data in question. Besides, I tried to select different layers and chose the configuration that allowed coming around minimum error. Origin is the diagram of the benchmark exercising.

PCA and t-SNE dimension reducers

On this stage, you have 450 vectors with dimension 128. This result is quite good, but it is not good enough to give it away to a human. That’s why let’s go deeper. Let’s use the PCA and t-SNE approaches to lessen the dimension. There are many articles devoted to the principal component analysis method (PCA), so I won’t include any descriptions herein, however, I would like to tell you about the t-SNE approach. The initial document, Visualizing data using t-SNE, contains a detailed description of the algorithm; I will take reducing two-dimensional space to one-dimensional space as an example.

There is a 2D space and three classes (A, B, and C) located within this space. Let’s try to project the classes to one of the axes.

Classes

Projection classes.

As you can see, none of the axes is able to give us the broad picture of the initial classes. The classes get all mixed up, and, as a result, lose their initial characteristics. The task is to arrange the elements in the eventual space maintaining the distance ratio they had in the initial space. That is, the elements that were close to each other should remain closer than those located farther.

Stochastic Neighbor Embedding

Let’s convey the initial relation between the datapoints in the initial space as the distance between the points \(x_i\), \(x_j\) in Euclidean space: \(\mathopen|x_i – x_j\mathclose|\) and \(\mathopen| y_i – y_j \mathclose|\) correspondingly for the point in the space in question.

Let’s define conditional probabilities that represent similarities of points in the initial space:

\(p_{ij}=\frac{exp(- \mathopen||x_i – x_j\mathclose|| ^2 /2\sigma^2)}{ \sum_{k \neq l} exp(- \mathopen||x_k – x_l\mathclose|| ^2 /2\sigma^2)}\)

This expression shows how close the point \(x_j\) is to \(x_i\) providing that you define the distance to the nearest datapoints in the class as Gaussian distribution centered at \(x_i\) with the given variance \(\sigma\) (centered at point \(x_i\)).  Variance is unique for each datapoint and is determined separately based on the assumption that the points with higher density have lower variance.

Now let’s describe the similarity of datapoint  and datapoint  correspondingly in the new space:

\(q_{ij}=\frac{(1 + \mathopen||y_i – y_j\mathclose|| ^2)^{-1}}{ \sum_{k \neq l}(1 + \mathopen||y_k – y_l\mathclose|| ^2 )^{-1}}\)

Again, since we are only interested in modeling pairwise similarities, we set \(q_{ij} = 0\).

If the map points \(y_i\) and \(y_j\) correctly model the similarity between the high-dimensional datapoints \(x_i\) and \(x_j\), the conditional probabilities \(p_{ij}\) and \(q_{ij}\) will be equal. Motivated by this observation, SNE aims to find a low-dimensional data representation that minimizes the mismatch between \(p_{ij}\) and \(q_{ij}\) .

The algorithm finds the variance  for Gaussian distribution over each  datapoint \(x_i\). It is not likely that there is a single value of \(\sigma_i \) that is optimal for all datapoints in the data set because the density of the data is likely to vary. In dense regions, a smaller value of \(\sigma_i \) is usually more appropriate than in sparser regions.

SNE performs a binary search for the value of . The search is performed considering a measure of the effective number of neighbors (perplexity parameter) that will be taken into account when calculating .

The authors of this algorithm found an example in physics, and describe the algorithm as a set of objects with various springs that are capable of repelling and attracting other objects. If the system is not interfered with for some time, it will find a stationary point by balancing the strain of all springs.

t-Distributed Stochastic Neighbor Embedding

The difference between the SNE and t-SNE algorithm is that t-SNE uses a Student-t distribution (also known as t-Distribution, t-Student distribution) rather than a Gaussian, and a symmetrized version of the SNE cost function.

That is, at first the algorithm locates all initial objects in the lower-dimensional space. After that it moves object by object basing on the distance between them (which objects were located closer/farther) in the initial space.

TensorFlow, TensorBoard, and Projector.

There is no need to implement such algorithms yourself nowadays. You can use such ready-to-use mathematical packages as scikit, MATLAB, or TensorFlow.

In my previous article, I mentioned that the TensorFlow toolkit contains a package for data and exercising process visualization called TensorBoard. Let’s use this solution.

  • image
  • image
  • image
  • image
  • image
  • image
  • image
  • image
  • image

There is another way, an entire portal called projector that allows you to visualize your dataset directly on the Google server:

  1. Open the TensorBoard Projector website.
  2. Click Load Data.
  3. Select our dataset with vectors.
  4. Add the metadata prepared earlier: labels, classes, etc.
  5. Enable color map by one of the available columns.
  6. Optionally, add JSON *.config file and publish data for public view.

Now you can send the link to your analyst.

Those interested in the subject domain may find useful viewing various slices, for example:

  • Distribution of votes of politicians from different regions.
  • Voting accuracy of different parties.
  • Distribution of voting of politicians from one party.
  • Similarity of voting of politicians from different parties.  

Conclusions

  • Autoencoders represent a range of simple algorithms that give surprisingly quick and good convergence result.
  • Automatic clustering does not answer the question about the nature of the initial data and requires further analysis; however, it provides a quick and clear vector that allows you to start working with your data.
  • TensorFlow and TensorBoard are powerful and fast-evolving tools for machine learning that allow solving tasks of diverse complexity.

Author: Volodymyr Pavliukevych

Senior Software Engineer, Data Scientist.

Senior Software Engineer, Data Scientist.

Leave a Reply