A few years ago, ML algorithms looked strange and difficult for an average software engineer. ML is growing really fast. Nowadays it is easy to improve production solution by some Artificial Intelligence. You don’t need to have twenty people in your Data Scientist department if you want to extend you service with smart analytics or Artificial Intelligence.

I will show you how to apply smart search in your service.

Currently, our service is a place, where each user can share their articles, documents, videos, calendar events, tasks and etc. So we have a huge database with users’ content. Now it is a problem for a user to search a certain document or event. All items have tags and full text search. But what about video and audio files?

Usual creating usecase is:

1) A user adds youtube link or uploads a file.

2) A user names the new item: ‘New movie 01’.

And that is all. No one wants to spend time improving their own content.

In real world we can help our customers. At first we can try to ask Youtube about details of a particular video for future search improvement. But unfortunately, as usual it is the same lazy user or it could be a video file from PC.

So let’s make the AI search for tags and full text in users’ content!

What do we need? As I told earlier nowadays it is much easier.

Task:

1) Download youtube video;

2) Extract audio track;

3) Perform Speech to Text;

4) Search for keywords – they will be our tags;

5) Search for the most relevant sentence for full text search;

We will do this all on python, it will reduce time of development and integration with ML frameworks.

As I said above, there are a lot of ML frameworks ready to go. One of them is Mozilla DeepSpeech.

In order to start, we need to download the latest model from github and install pip module.

mkdir /Users/Volodymyr/Projects/deepspeech/ cd /Users/Volodymyr/Projects/deepspeech/ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.3.0/deepspeech-0.3.0-models.tar.gz tar zxvf deepspeech-0.3.0-models.tar.gz pip3 install wave numpy tensorflow youtube_dl ffmpeg-python deepspeech nltk networkx brew install ffmpeg wget

DeepSpeech is based on TensorFlow, so it can be easily used both on CPU and GPU.

Next step is to download and process a video file from youtube. We can use youtube-dl library for that.

After that we can extract audio track and resample it to the acceptable format.

_ = ffmpeg.input(youtube_id + '.wav').output(output_file_name, ac=1, t=crop_time, ar='16k').overwrite_output().run(capture_stdout=False)

Wave library will help us to represent our file in numpy array before deepspeech begins its work.

fin = wave.open(file_name, 'rb') framerate_sample = fin.getframerate() if framerate_sample != 16000: print('Warning: original sample rate ({}) is different than 16kHz. Resampling might produce erratic speech recognition.'.format(framerate_sample), file=sys.stderr) fin.close() return else: audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16) audio_length = fin.getnframes() * (1/16000) fin.close()

Step number two. Searching for meaning.

We need to find a few keywords and relevant sentences in the text. To make it possible, we will use Graph-based ranking algorithms. The basic idea implemented by a graph-based ranking model is that of “voting” or “recommendation”. When one vertex links to another one, it is basically casting a vote for that other vertex. The higher the number of votes that are cast for a vertex, the higher the importance of the vertex. Moreover, the importance of the vertex casting the vote determines how important the vote itself is, and this information is also taken into account by the ranking model. Hence, the score associated with a vertex is determined based on the votes that are cast for it, and the score of the vertices cast- ing these votes. The score of a vertex is defined as follows:

Where d is a damping factor that can be set between 0 and 1, which has the role of integrating into the model the probability of jumping from a given vertex to another random vertex in the graph.

However, in our model the graphs are build from natural language texts, and may include multiple or partial links between the units (vertices) that are extracted from text. It may be therefore useful to indicate and incorporate into the model the “strength” of the connection between two vertices Vi and Vj as a weight wij added to the corresponding edge that connects the two vertices.

The task of a keyword extraction application is to automatically identify in a text a set of terms that best describe the document. Such keywords may constitute useful entries for building an automatic index for a document collection, can be used to classify a text, or may serve as a concise summary for a given document. Moreover, a system for automatic identification of important terms in a text can be used for the problem of terminology extraction, and construction of domain-specific dictionaries.

The simplest possible approach is perhaps to use a frequency criterion to select the “important” keywords in a document. However, this method was generally found to lead to poor results, and consequently other methods were explored. The state of the art in this area is currently represented by supervised learning methods, where a system is trained to recognize keywords in a text, based on lexical and syntactic features.The expected end result for this application is a set of words or phrases that are representative for a given natural language text. The units to be ranked are therefore sequences of one or more lexical units extracted from text, and these represent the vertices that are added to the text graph. Any relation that can be defined between two lexical units is a potentially useful connection (edge) that can be added between two such vertices. We are using a co-occurrence relation, controlled by the distance between word occurrences: two vertices are connected if their corresponding lexical units co-occur within a window of maximum N words, where N can be set anywhere from 2 to 10 words.

The vertices added to the graph can be restricted with syntactic filters, which select only lexical units of a certain part of speech. One can for instance consider only nouns and verbs for addition to the graph, and consequently draw potential edges based only on relations that can be established between nouns and verbs. We experimented with various syntactic filters, including: all open class words, nouns and verbs only, etc., with best results observed for nouns and adjectives only. The TextRank keyword extraction algorithm is fully unsupervised, and proceeds as follows. First, the text is tokenized, and annotated with part of speech tags – a preprocessing step required to enable the application of syntactic filters. To avoid excessive growth of the graph size by adding all possible combinations of sequences consisting of more than one lexical unit (ngrams), we consider only single words as candidates for addition to the graph, with multi-word keywords being eventually reconstructed in the post-processing phase. Next, all lexical units that pass the syntactic filter are added to the graph, and an edge is added between those lexical units that co-occur within a window of words. After the graph is constructed (undirected unweighted graph), the score associated with each vertex is set to an initial value of 1, and the ranking algorithm described in above is run on the graph for several iterations until it converges – usually for 20-30 iterations, at a threshold of 0.0001.

In a way, the problem of sentence extraction can be regarded as similar to keyword extraction, since both applications aim at identifying sequences that are more “representative” for the given text. In keyword extraction, the candidate text units consist of words or phrases, whereas in sentence extraction, we deal with entire sentences. Algorithm turns out to be well suited for this type of applications, since it allows for a ranking over text units that is recursively computed based on information drawn from the entire text.

To apply our service, we first need to build a graph associated with the text, where the graph vertices are representative for the units to be ranked. For the task of sentence extraction, the goal is to rank entire sentences, and therefore a vertex is added to the graph for each sentence in the text.

The co-occurrence relation used for keyword extraction cannot be applied here, since the text units in consideration are significantly larger than one or few words, and “co-occurrence” is not a meaningful relation for such large contexts. Instead, we are defining a different relation, which determines a connection between two sentences if there is a “similarity” relation between them, where “similarity” is measured as a function of their content overlap. Such a relation between two sentences can be seen as a process of “recommendation”: a sentence that addresses certain concepts in a text, gives the reader a “recommendation” to refer to other sentences in the text that address the same concepts, and therefore a link can be drawn between any two such sentences that share common content.

The overlap of two sentences can be determined simply as the number of common tokens between the lexical representations of the two sentences, or it can be run through syntactic filters, which only count words of a certain syntactic category, e.g. all open class words, nouns and verbs, etc. Moreover, to avoid promoting long sentences, we are using a normalization factor, and divide the content overlap of two sentences with the length of each sentence.

At the end we will have a list of keywords which will be used as tags and one selected sentence for full text search approach.

All code is available on GitHub. Good luck!

]]>

The Flow Inspector is tool which can help you review your TensorFlow Graph in your Swift written program.

The video below shows the default layout of the Flow Inspector debugger and main interaction process.

The Four Parts of Debugging and the Debugging Tools

There are four parts to the debugging workflow:

- File Navigator – where you can select bin and source files.
- Source section to review your code and select certain function to review.
- Graph section to review your graph inside Flow Inspector.
- Console output section to review output and errors in your program.

Flow Inspector Alpha version is available on GitHub.

]]>Official documentation describes the compilation process:

Once the tensor operations are desugared, a transformation we call “partitioning” extracts the graph operations from the program and builds a new SIL function to represent the tensor code. In addition to removing the tensor operations from the host code, new calls are injected that call into our new runtime library to start up TensorFlow, rendezvous to collect any results, and send/receive values between the host and the tensor program as it runs. The bulk of the Graph Program Extraction transformation itself lives in TFPartition.cpp.

Once the tensor function is formed, it has some transformations applied to it, and is eventually emitted to a TensorFlow graph using the code in TFLowerGraph.cpp. After the TensorFlow graph is formed, we serialize it to a protobuf and encode the bits directly into the executable, making it easy to load at program runtime.

Actually the final graph is serialized into protobuf bytes and copied directly into the executable file.

I made a small debug tool, – Flow Inspector which can handle that problem.

You can find package template and readme on my GitHub page.

]]>There are some interesting points:

1) High level APIs will be presented as a separate SwiftPM package under github.com/tensorflow.

High level APIs were added earlier purely to explore the programming model, not to be usable by anyone. Having high level APIs be part of the stdlib module conveys a wrong message for beta testers, and it has been confusing ever since our open source release.

2) Supporting Python code is one of priority:

- Improved Python diagnostics related to member access.
- Improved Python C API functions for binary arithmetic operations.

3) Improved cross-device sends and receives support.

4) Lots of work done around supporting generic @dynamicCallable methods.

5) Deprecated `a.dot(b)`

and `⊗`

to `matmul(a, b)`

.

Google brain team launch a new project ‘Swift for TensorFlow’.

Swift for TensorFlow is a new way to develop machine learning models. It gives you the power of TensorFlow directly integrated into the Swift programming language.

That means next few month I will work on Kraken’s new API. Join community, follow updates.

]]>Online demo of t-SNE visualization you can see here.

Machine learning algorithms have been put to good use in various areas for several years already. Analysis of various political events can become one of such areas. For instance, it can be used for predicting voting results, developing mechanisms for clustering the decisions made, analysis of political actors’ actions. In this article, I will try to describe the result of a research in this area.

Modern machine learning capabilities allow converting and visualizing huge amounts of data. Thereby it became possible to analyze political parties’ activities by converting voting instances that took place during 4 years into a self-organizing space of points that reflects actions of each elected official.

Each politician expressed themselves via 12 000 voting instances. Each voting instance can represent one of five possible actions (the person was absent, skipped the voting, voted approval, voted in negative, abstained).

The task is to convert the results of all voting instances into a point in the 3D Euclidean space that will reflect some considered attitude.

The original data was taken from the official website and converted into intermediate data for a neural network.

Considering the problem definition, it is necessary to represent 12 000 voting instances as a vector of the 2 or 3 dimension. Humans can operate 2- or 3-dimension spaces, and it is quite difficult to imagine more spaces.

Let’s apply autoencoder to decrease the capacity.

The autoencoder is based on two functions:

\(h = e\left(x \right)\) – encoding function;

\(x’ = d(h)\) – decoding function;

The initial vector \(x\) with dimension \(m\) is supplied to the neural network as an input, and the network converts it into the value of the hidden layer \(h\) with dimension \(n\). After that the neural network decoder converts the value of the hidden layer \(h\) into an output vector \(x\) with dimension \(m\), while \(m > n\). That is, in the result the hidden layer \(h\) will be of lesser dimension, while being able to display all the range of the initial data.

Objective cost function is used for exercising the network:

\(L=(x, x’)=(x, d(e(x))\)

In other words, the difference between the values of the input and output layers is minimized. Exercised neural network allows compressing the dimension of the initial data to some dimension \(n\) on the hidden layer \(h\) .

On the figure, you can see one input layer, one hidden layer and one output layer. There can be more such layers in a real-case scenario.

Now we are finished with the theoretical part, let’s do some practice.

The data has been collected from the official site in the JSON format, and encoded into a vector already.

Now there is a dataset with dimension 24000 x 453. Let’s create a neural network using the TensorFlow means:

# Building the encoder def encoder(x): with tf.variable_scope('encoder', reuse=False): with tf.variable_scope('layer_1', reuse=False): w1 = tf.Variable(tf.random_normal([num_input, num_hidden_1]), name="w1") b1 = tf.Variable(tf.random_normal([num_hidden_1]), name="b1") layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, w1), b1)) with tf.variable_scope('layer_2', reuse=False): w2 = tf.Variable(tf.random_normal([num_hidden_1, num_hidden_2]), name="w2") b2 = tf.Variable(tf.random_normal([num_hidden_2]), name="b2") layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, w2), b2)) with tf.variable_scope('layer_3', reuse=False): w2 = tf.Variable(tf.random_normal([num_hidden_2, num_hidden_3]), name="w2") b2 = tf.Variable(tf.random_normal([num_hidden_3]), name="b2") layer_3 = tf.nn.sigmoid(tf.add(tf.matmul(layer_2, w2), b2)) return layer_3 # Building the decoder def decoder(x): with tf.variable_scope('decoder', reuse=False): with tf.variable_scope('layer_1', reuse=False): w1 = tf.Variable(tf.random_normal([num_hidden_3, num_hidden_2]), name="w1") b1 = tf.Variable(tf.random_normal([num_hidden_2]), name="b1") layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, w1), b1)) with tf.variable_scope('layer_2', reuse=False): w1 = tf.Variable(tf.random_normal([num_hidden_2, num_hidden_1]), name="w1") b1 = tf.Variable(tf.random_normal([num_hidden_1]), name="b1") layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, w1), b1)) with tf.variable_scope('layer_3', reuse=False): w2 = tf.Variable(tf.random_normal([num_hidden_1, num_input]), name="w2") b2 = tf.Variable(tf.random_normal([num_input]), name="2") layer_3 = tf.nn.sigmoid(tf.add(tf.matmul(layer_2, w2), b2)) return layer_3 # Construct model encoder_op = encoder(X) decoder_op = decoder(encoder_op) # Prediction y_pred = decoder_op # Targets (Labels) are the input data. y_true = X

All project code available on GitHub page.

The network will be exercised by the RMSProb optimizer with learning rate 0.01. In the result, you can see the TensorFlow operation chart:

For extra testing purposes, let’s select the first four vectors and render their values as images on the neural network input and output. This way you can ensure that the values of the input and output layers are “identical” (to a tolerance).

Now let’s gradually pass all input data to the neural network and extract values of the hidden layer. These values are the compressed data in question. Besides, I tried to select different layers and chose the configuration that allowed coming around minimum error. Origin is the diagram of the benchmark exercising.

On this stage, you have 450 vectors with dimension 128. This result is quite good, but it is not good enough to give it away to a human. That’s why let’s go deeper. Let’s use the PCA and t-SNE approaches to lessen the dimension. There are many articles devoted to the principal component analysis method (*PCA*), so I won’t include any descriptions herein, however, I would like to tell you about the t-SNE approach. The initial document, **Visualizing data using t-SNE**, contains a detailed description of the algorithm; I will take reducing two-dimensional space to one-dimensional space as an example.

There is a 2D space and three classes (A, B, and C) located within this space. Let’s try to project the classes to one of the axes.

As you can see, none of the axes is able to give us the broad picture of the initial classes. The classes get all mixed up, and, as a result, lose their initial characteristics. The task is to arrange the elements in the eventual space maintaining the distance ratio they had in the initial space. That is, the elements that were close to each other should remain closer than those located farther.

Let’s convey the initial relation between the datapoints in the initial space as the distance between the points \(x_i\), \(x_j\) in Euclidean space: \(\mathopen|x_i – x_j\mathclose|\) and \(\mathopen| y_i – y_j \mathclose|\) correspondingly for the point in the space in question.

Let’s define conditional probabilities that represent similarities of points in the initial space:

\(p_{ij}=\frac{exp(- \mathopen||x_i – x_j\mathclose|| ^2 /2\sigma^2)}{ \sum_{k \neq l} exp(- \mathopen||x_k – x_l\mathclose|| ^2 /2\sigma^2)}\)

This expression shows how close the point \(x_j\) is to \(x_i\) providing that you define the distance to the nearest datapoints in the class as Gaussian distribution centered at \(x_i\) with the given variance \(\sigma\) (centered at point \(x_i\)). Variance is unique for each datapoint and is determined separately based on the assumption that the points with higher density have lower variance.

Now let’s describe the similarity of datapoint and datapoint correspondingly in the new space:

\(q_{ij}=\frac{(1 + \mathopen||y_i – y_j\mathclose|| ^2)^{-1}}{ \sum_{k \neq l}(1 + \mathopen||y_k – y_l\mathclose|| ^2 )^{-1}}\)

Again, since we are only interested in modeling pairwise similarities, we set \(q_{ij} = 0\).

If the map points \(y_i\) and \(y_j\) correctly model the similarity between the high-dimensional datapoints \(x_i\) and \(x_j\), the conditional probabilities \(p_{ij}\) and \(q_{ij}\) will be equal. Motivated by this observation, SNE aims to find a low-dimensional data representation that minimizes the mismatch between \(p_{ij}\) and \(q_{ij}\) .

The algorithm finds the variance for Gaussian distribution over each datapoint \(x_i\). It is not likely that there is a single value of \(\sigma_i \) that is optimal for all datapoints in the data set because the density of the data is likely to vary. In dense regions, a smaller value of \(\sigma_i \) is usually more appropriate than in sparser regions.

SNE performs a binary search for the value of . The search is performed considering a measure of the effective number of neighbors (perplexity parameter) that will be taken into account when calculating .

The authors of this algorithm found an example in physics, and describe the algorithm as a set of objects with various springs that are capable of repelling and attracting other objects. If the system is not interfered with for some time, it will find a stationary point by balancing the strain of all springs.

The difference between the SNE and t-SNE algorithm is that t-SNE uses a Student-t distribution (also known as t-Distribution, t-Student distribution) rather than a Gaussian, and a symmetrized version of the SNE cost function.

That is, at first the algorithm locates all initial objects in the lower-dimensional space. After that it moves object by object basing on the distance between them (which objects were located closer/farther) in the initial space.

There is no need to implement such algorithms yourself nowadays. You can use such ready-to-use mathematical packages as scikit, MATLAB, or TensorFlow.

In my previous article, I mentioned that the TensorFlow toolkit contains a package for data and exercising process visualization called TensorBoard. Let’s use this solution.

""" Projector realisation for data visualisation. Author: Volodymyr Pavliukevych. """ import os import numpy as np import tensorflow as tf from tensorflow.contrib.tensorboard.plugins import projector # Create datasets first_D = 23998 # Number of items (size). second_D = 11999 # Number of items (size). DATA_DIR = '' LOG_DIR = DATA_DIR + 'embedding/' # Load data from autoencoder. first_rada_input = np.loadtxt(DATA_DIR + 'result_' + str(first_D) + '/rada_full_packed.tsv', delimiter='\t') second_rada_input = np.loadtxt(DATA_DIR + 'result_' + str(second_D) + '/rada_full_packed.tsv', delimiter='\t') # Create variables. first_embedding_var = tf.Variable(first_rada_input, name='politicians_embedding_' + str(first_D)) second_embedding_var = tf.Variable(second_rada_input, name='politicians_embedding_' + str(second_D)) saver = tf.train.Saver() with tf.Session() as session: session.run(tf.global_variables_initializer()) saver.save(session, os.path.join(LOG_DIR, "model.ckpt"), 0) config = projector.ProjectorConfig() # You can add multiple embeddings. first_embedding = config.embeddings.add() second_embedding = config.embeddings.add() first_embedding.tensor_name = first_embedding_var.name second_embedding.tensor_name = second_embedding_var.name # Link this tensor to its metadata file (e.g. labels). first_embedding.metadata_path = os.path.join(DATA_DIR, '../rada_full_packed_labels.tsv') second_embedding.metadata_path = os.path.join(DATA_DIR, '../rada_full_packed_labels.tsv') # Attach prepared bookmarks. first_embedding.bookmarks_path = = os.path.join(DATA_DIR, '../result_23998/bookmarks.txt') second_embedding.bookmarks_path = = os.path.join(DATA_DIR, '../result_11999/bookmarks.txt') # Use the same LOG_DIR where you stored your checkpoint. summary_writer = tf.summary.FileWriter(LOG_DIR) # The next line writes a projector_config.pbtxt in the LOG_DIR. TensorBoard will # read this file during startup. projector.visualize_embeddings(summary_writer, config)

There is another way, an entire portal called projector that allows you to visualize your dataset directly on the Google server:

- Open the TensorBoard Projector website.
- Click
**Load Data.** - Select our dataset with vectors.
- Add the metadata prepared earlier: labels, classes, etc.
- Enable color map by one of the available columns.
- Optionally, add JSON *.config file and publish data for public view.

Now you can send the link to your analyst.

Those interested in the subject domain may find useful viewing various slices, for example:

- Distribution of votes of politicians from different regions.
- Voting accuracy of different parties.
- Distribution of voting of politicians from one party.
- Similarity of voting of politicians from different parties.

- Autoencoders represent a range of simple algorithms that give surprisingly quick and good convergence result.
- Automatic clustering does not answer the question about the nature of the initial data and requires further analysis; however, it provides a quick and clear vector that allows you to start working with your data.
- TensorFlow and TensorBoard are powerful and fast-evolving tools for machine learning that allow solving tasks of diverse complexity.

When I started working in the field of machine learning, it was quite difficult to move to vectors and spaces from objects and their behavior. At first it was rather complicated to wrap my head around all that, and most processes did not seem obvious and clear at once. That’s the reason why I did my best to visualize everything I did in my groundwork: I used to create 3D models, graphs, diagrams, figures, etc.

When speaking about efficient development of machine learning systems, usually such problems as learning speed control, learning process analysis, gathering various learning metrics, and others are mentioned. The major difficulty is that we (people) use 2D and 3D spaces to describe various processes that take place around us. However, processes within neural networks lay in multidimensional spaces, and that makes them rather difficult to understand. Engineers all around the world understand this problem and try to develop various approaches to the visualization or conversion of multidimensional data into simpler and more understandable forms.

There are separate communities dedicated to solving such problems, for example, Distill, Welch Labs, 3Blue1Brown.

Before I started working with TensorFlow, I used the TensorBoard package. It turned out to be a handy cross-platform solution for visualizing different kinds of data. I spent a couple of days “teaching” the Swift application to create reports in the TensorBoard format and integrate them into my neural network.

Development of TensorBoard started in the middle of 2015 in one of the Google laboratories. In the end of 2015 Google opened the source code and the project became an open source one.

The current version of TensorBoard is a Python package created using TensorFlow, and it allows visualization of the following kinds of data:

- Scalar data in time stack with the smoothing option
- Images in case you can represent your data in 2D, for example, convolutional network weights (filters)
- Actual computational graph (as an interactive view)
- 2D modifications of tensor values over time
- 3D histogram-modification of data allocation within tensor over time
- Text
- Audio

Besides, there is a projector and a possibility to extend TensorBoard using plugins, but that is a topic for another article.

You need to install TensorBoard on your computer (Ubuntu or Mac) to get started.

Also, you need to install Python 3. I recommend to install TensorBoard as a part of the TensorFlow package for Python.

Linux: $ sudo apt-get install python3-pip python3-dev $ pip3 install tensorflow MacOS: $ brew install python3 $ pip3 install tensorflow

Now run TensorBoard after specifying a directory for storing reports:

$ tensorboard --logdir /tmp/example/

Let’s open http://localhost:6006/.

Example with GitHub. Remember to like my repository.

Now let’s get a view of some cases using an example. Reports (summaries) in the TensorBoard format are created when the computational graph is being built. In TensorFlowKit, I did my best to duplicate Python approaches and interface, so that it would be possible to use shared documentation in the future. As I mentioned earlier, each report is added into a summary. It’s a container holding a value array, each of which representing an event we need to visualize. Later on, summary is saved to a file in the file system, where TensorBoard is able to read it.

So, we need to create FileWriter after specifying the graph we are going to visualize, and create a summary that will hold our values.

let summary = Summary(scope: scope) let fileWriter = try FileWriter(folder: writerURL, identifier: "iMac", graph: graph)

After running the application and refreshing the page we will see the graph we’ve built in the code. It will be interactive, so we can navigate it.

Also, we want to see changes of some scalar value over time, for example, the value of the loss function and the accuracy of our neural network. To do that, let’s add output of the operations to the summary:

try summary.scalar(output: accuracy, key: "scalar-accuracy") try summary.scalar(output: cross_entropy, key: "scalar-loss")

So, after each step of calculations of our session, TensorFlow automatically subtracts values of our operations and passes them to the input of the resulting summary that will be saved in FileWriter (I will tell you how to do that a bit later).

There is a lot of weights and biases in our neural network. Usually these are various high dimension matrices, and it is quite difficult to analyze their values by displaying them (printing out). It’s better to create a distribution diagram. Also, let’s add information on the weights change value that is made by our network during the learning process into our Summary.

try summary.histogram(output: bias.output, key: "bias") try summary.histogram(output: weights.output, key: "weights") try summary.histogram(output: gradientsOutputs[0], key: "GradientDescentW") try summary.histogram(output: gradientsOutputs[1], key: "GradientDescentB")

Now we have a visualization of weights changes and the changes during the learning process.

However, that is not all. Let’s take a look at the organization of our neural network. Each figure with handwriting received on input finds some reflection in the corresponding weights. That means the input figure can activate certain neurons, and this way leaves some mark in our network. Let me remind you that we have 784 weights for each neuron out of 10. So, we’ve got 7840 weights. All of them are represented as a 784×10 matrix. Let’s try to evolve the matrix into a vector and after that extract the weights that correspond to each class.

let flattenConst = try scope.addConst(values: [Int64(7840)], dimensions: [1], as: "flattenShapeConst") let imagesFlattenTensor = try scope.reshape(operationName: "FlattenReshape", tensor: weights.variable, shape: flattenConst.defaultOutput, tshape: Int64.self) try extractImage(from: imagesFlattenTensor, scope: scope, summary: summary, atIndex: 0) try extractImage(from: imagesFlattenTensor, scope: scope, summary: summary, atIndex: 1) … try extractImage(from: imagesFlattenTensor, scope: scope, summary: summary, atIndex: 8) try extractImage(from: imagesFlattenTensor, scope: scope, summary: summary, atIndex: 9)

To do that, let’s add a couple of operations to our graph: *stridedSlice* and *reshape*.

Now let’s add each vector we get into the Summary as an image.

try summary.images(name: "Image-\(String(index))", output: imagesTensor, maxImages: 255, badColor: Summary.BadColor.default)

In the Images section of TensorBoard, we can see the weights’ “marks” as they used to be during the learning process.

Now let’s process our Summary. To do that, we need to join all created Summaries into one, and process it while exercising the network.

let _ = try summary.merged(identifier: "simple")

While the network works:

let resultOutput = try session.run(inputs: [x, y], values: [xTensorInput, yTensorInput], outputs: [loss, applyGradW, applyGradB, mergedSummary, accuracy], targetOperations: []) let summary = resultOutput[3] try fileWriter?.addSummary(tensor: summary, step: Int64(index))

*Please keep in mind that I did not consider the problem of the accuracy calculation, as it is calculated basing on the learning data. It is not correct to calculate it basing on the data for exercising.*

In the next article I will explain how to build one neural network and launch it on Ubuntu, MacOS, iOS from one source. * *

]]>

Please, read my previous post about Swift & TensorFlow

I took “Hello World!” in the universe of neural networks as an example, a task for systematization of MNIST images. MNIST dataset includes thousands of images of handwritten numbers, the size of each image is 28×28 pixels. So, we have ten classes that are neatly divided into 60 000 images for educating and 10 000 images for testing. Our task is to create a neural network that is able to classify an image and determine the class it belongs to (out of 10 classes).

Before you can start working with TensorFlowKit, you need to install TensorFlow On Mac OS, you can use the *brew* package manager:

$ brew install libtensorflow

Assembly for Linux is available here.

Let’s create a Swift project and add a dependency:

dependencies: [ .package(url: "https://github.com/Octadero/TensorFlow.git", from: "0.0.7") ]

Now we should prepare the MNIST dataset.

I have written a Swift package for working with the MNIST dataset that you can find here. This package will download the dataset to a temporary folder, unpack it, and represent it as ready-to-use classes.

For example:

dataset = MNISTDataset(callback: { (error: Error?) in print("Ready") })

Now let’s create the required operation graph.

The space and subspace of the calculation graph is called scope and can have its own name. We’ll provide two vectors for the network input. The first one contains the images represented as a 784 high-dimension vector (28×28 px). So, each component of the *x* vector will contain a Float from 0.0-1.0 value that corresponds to the color of the pixel on the image. The second vector will be an encrypted matching class (see below), where the corresponding component 1 matches the class number. In the following example it’s class 2.

[0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ]

As input parameters will change during the educative process, let’s create a placeholder to refer to them.

/// Input sub scope let inputScope = scope.subScope(namespace: "input") let x = try inputScope.placeholder(operationName: "x-input", dtype: Float.self, shape: Shape.dimensions(value: [-1, 784])) let yLabels = try inputScope.placeholder(operationName: "y-input", dtype: Float.self, shape: Shape.dimensions(value: [-1, 10]))

That’s how Input looks on the graph:

That is our input layer. Now let’s create weights (connections) between the input and hidden layer.

let weights = try weightVariable(at: scope, name: "weights", shape: Shape.dimensions(value: [784, 10])) let bias = try biasVariable(at: scope, name: "biases", shape: Shape.dimensions(value: [10]))

We will create a variable operation in the graph, because the weights and bases will be customized during the educative process. Let’s initialize them using a tensor filled with nulls.

Now let’s create a hidden layer that will perform such primitive operation as *(x * W) + b*. This operation multiplies vector *x* (dimension 1×784) by matrix *W* (dimension 784×10) and adds basis.

In our case the hidden layer is the output layer (the task of the *“Hello World!”* level), that’s why we need to analyze the output signal and decide the winner. To do that, we should use the softmax operation.

I suggest to take our neural network as a complicated function in order to better understand what I will be talking about hereafter. We input vector *x* (representing the image) to our function. In the output we get a vector that shows the probability of the input vector belonging to each of the available classes.

Now let’s take a natural logarithm of the received probability for each class and multiply it by the value of the vector of the right class neatly passed in the very beginning (yLabel). This way we will get the error value and use it to “judge” the neural network. The figure below demonstrates two samples. In the first sample, for class 2 the error value is 2.3, and in the second sample, for class 1 the error value is 0.

let log = try scope.log(operationName: "Log", x: softmax) let mul = try scope.mul(operationName: "Mul", x: yLabels, y: log) let reductionIndices = try scope.addConst(tensor: Tensor(dimensions: [1], values: [Int(1)]), as: "reduction_indices").defaultOutput let sum = try scope.sum(operationName: "Sum", input: mul, reductionIndices: reductionIndices, keepDims: false, tidx: Int32.self) let neg = try scope.neg(operationName: "Neg", x: sum) let meanReductionIndices = try scope.addConst(tensor: Tensor(dimensions: [1], values: [Int(0)]), as: "mean_reduction_indices").defaultOutput let cross_entropy = try scope.mean(operationName: "Mean", input: neg, reductionIndices: meanReductionIndices, keepDims: false, tidx: Int32.self)

If talking mathematical language, we have to minimize the target function. To do that, the gradient descent method can be used. If it may become necessary, I will try to describe this method in another article.

So, we should calculate how to correct each of the weighs (components of the *W* matrix) and the basis vector *b*, so that the neural network would make smaller error when receiving similar input data. In the context of math, we should find the partial derivatives of the output node by the values of all intermediate nodes. The symbolic gradients we’ve got allow us to “move” the values of the* W* and *b* variables according to the extent it affected the result of the previous calculations.

**TensorFlow Magic**

The thing is that TensorFlow can perform all (however, not all at the very moment) these complicated calculations automatically by analyzing the graph we created.

let gradientsOutputs = try scope.addGradients(yOutputs: [cross_entropy], xOutputs: [weights.variable, bias.variable])

After this operation call, TensorFlow will create about fifty more operations.

Now it is enough to add an operation for updating the weights to the value we received earlier using the gradient descent method.

let _ = try scope.applyGradientDescent(operationName: "applyGradientDescent_W", `var`: weights.variable, alpha: learningRate, delta: gradientsOutputs[0], useLocking: false)

That’s it – the graph is ready!

As I said, TensorFlow separates the model and calculations. That’s why the graph we created is only a model for performing calculations. We can use Session to start the calculation process. Let’s prepare data from the dataset, place it to tensors, and run the session.

guard let dataset = dataset else { throw MNISTTestsError.datasetNotReady } guard let images = dataset.files(for: .image(stride: .train)).first as? MNISTImagesFile else { throw MNISTTestsError.datasetNotReady } guard let labels = dataset.files(for: .label(stride: .train)).first as? MNISTLabelsFile else { throw MNISTTestsError.datasetNotReady } let xTensorInput = try Tensor(dimensions: [bach, 784], values: xs) let yTensorInput = try Tensor(dimensions: [bach, 10], values: ys)

It is necessary to run the session several times to let it recalculate the value several times.

for index in 0..<1000 { let resultOutput = try session.run(inputs: [x, y], values: [xTensorInput, yTensorInput], outputs: [loss, applyGradW, applyGradB], targetOperations: []) if index % 100 == 0 { let lossTensor = resultOutput[0] let gradWTensor = resultOutput[1] let gradBTensor = resultOutput[2] let wValues: [Float] = try gradWTensor.pullCollection() let bValues: [Float] = try gradBTensor.pullCollection() let lossValues: [Float] = try lossTensor.pullCollection() guard let lossValue = lossValues.first else { continue } print("\(index) loss: ", lossValue) lossValueResult = lossValue print("w max: \(wValues.max()!) min: \(wValues.min()!) b max: \(bValues.max()!) min: \(bValues.min()!)") } }

The error range is shown after every 100 operations. In the next article, I will tell you how to calculate the accuracy of our network and how to visualize it using the means of TensorFlowKit.

]]>I think it is not necessary to explain the meaning of such terms as machine learning and artificial intelligence in 2017. You can find a lot of op-ed articles and research papers on this topic. So, I assume that the reader is familiar with the topic and knows definitions of basic terms. When talking about machine learning, data scientists and software engineers usually mean deep neural networks that became quite popular because of their productivity. So far there are many software solutions and packages for solving artificial neural networks tasks: Caffe, TensorFlow, Torch, Theano(rip), cuDNN, etc.

Swift is an innovative protocol-oriented open source programming language written within Apple by Chris Lattner (who recently left Apple and, after SpaceX, settled down in Google).

Apple OS already features different libraries for working with matrices and vector algebra, such as BLAS, BNNS, DSP, that were later on gathered in the single Accelerate library.

In 2015, small-scale solutions based on the Metal graphics technology for implementing math appeared.

In 2016, CoreML was introduced:

CoreML can import a finished and trained model (CaffeV1, Keras, scikit-learn) and allows developer to export it to an application.

So, in the first place, you need to prepare a model on another platform using the Python or C++ language and third-party frameworks. Second, you need to educate it using a third-party hardware based solution.

Only after that you can import it and start working with the Swift language. As for me, it all seems too complicated.

TensorFlow, as well as other software packages that implement artificial neural networks, provides a lot of prepared abstractions and mechanisms for working with processing elements, connections between them, error evaluation, and backpropagation. However, the difference between TensorFlow and other packages is that Jeff Dean (Google employee, author of DFS, TensorFlow, and many other wonderful solutions) decided to embed the idea of splitting data execution model and data execution process into TensorFlow. It means that in the first place you need to describe the so-called computation graph, and after that start the calculation process. Such approach allows splitting and adding flexibility to working with the data execution **model** and the data execution **process** itself by dividing execution by different units (processors, video cards, computers, and clusters).

To solve all mentioned tasks, starting from preparing a model and up to working with it in an ultimate application, I have written an interface that provides access to and allows working with TensorFlow using a single language.

The solution architecture has two levels: medium and high.

- On the low-level, the C module allows communicating with libtensorflow using the Swift language.
- On the medium-level, you can move from using C pointers and work with “comprehensible errors”.
- The high-level implements different abstractions used to access model elements and various utilities for exporting, importing, and visualizing graphs.

This way you can create a model (calculation graph) using Swift, educate it on a server running on Ubuntu OS using several video cards, and after that easily open it in your application running on Mac OS or tvOS. Development can be performed using familiar Xcode with all its virtues and shortcomings.

Artificial neural networks implement a model that resembles a simplified model of neural connections in the neural system tissues. Input signal in the form of a high dimension vector reaches the input layer that consists of processing elements. After that each input processing element transforms the signal basing on the connection properties (weights) between the processing elements and properties of the processing elements of the following layers, and passes the signal to the next layer. During the educative process, the pickup signal is generated and compared with the expected one. Basing on the differences between the actual pickup signal and expected signal, the error rate is determined. Later on, this error is used to calculate the so-called grade. The grade is a vector in the direction of which you need to correct connections between processing elements to make the network produce signals similar to the expected ones in the future. This process is called backpropagation. Therefore, processing elements and connections between them accumulate information necessary for generalizing properties of the data model that the current neural network exercises at the moment. Technical implementation comes to different math operations on matrices and vectors, that, in turn, have already been implemented to a certain extent by such solutions as BLAS, LAPACK, DSP, etc.

In next article I will tell you how to use TensorFlowKit to resolve MNIST task.

]]>