A few years ago, ML algorithms looked strange and difficult for an average software engineer. ML is growing really fast. Nowadays it is easy to improve production solution by some Artificial Intelligence. You don’t need to have twenty people in your Data Scientist department if you want to extend you service with smart analytics or Artificial Intelligence.
I will show you how to apply smart search in your service.
Currently, our service is a place, where each user can share their articles, documents, videos, calendar events, tasks and etc. So we have a huge database with users’ content. Now it is a problem for a user to search a certain document or event. All items have tags and full text search. But what about video and audio files?

Usual creating usecase is:
1) A user adds youtube link or uploads a file.
2) A user names the new item: ‘New movie 01’.

And that is all. No one wants to spend time improving their own content.
In real world we can help our customers. At first we can try to ask Youtube about details of a particular video for future search improvement. But unfortunately, as usual it is the same lazy user or it could be a video file from PC.
So let’s make the AI search for tags and full text in users’ content!
What do we need? As I told earlier nowadays it is much easier.

1) Download youtube video;
2) Extract audio track;
3) Perform Speech to Text;
4) Search for keywords – they will be our tags;
5) Search for the most relevant sentence for full text search;

We will do this all on python, it will reduce time of development and integration with ML frameworks.
As I said above, there are a lot of ML frameworks ready to go. One of them is Mozilla DeepSpeech.
In order to start, we need to download the latest model from github and install pip module.

DeepSpeech is based on TensorFlow, so it can be easily used both on CPU and GPU.
Next step is to download and process a video file from youtube. We can use youtube-dl library for that.

After that we can extract audio track and resample it to the acceptable format.

Wave library will help us to represent our file in numpy array before deepspeech begins its work.

Step number two. Searching for meaning.
We need to find a few keywords and relevant sentences in the text. To make it possible, we will use Graph-based ranking algorithms. The basic idea implemented by a graph-based ranking model is that of “voting” or “recommendation”. When one vertex links to another one, it is basically casting a vote for that other vertex. The higher the number of votes that are cast for a vertex, the higher the importance of the vertex. Moreover, the importance of the vertex casting the vote determines how important the vote itself is, and this information is also taken into account by the ranking model. Hence, the score associated with a vertex is determined based on the votes that are cast for it, and the score of the vertices cast- ing these votes. The score of a vertex is defined as follows:



Where d is a damping factor that can be set between 0 and 1, which has the role of integrating into the model the probability of jumping from a given vertex to another random vertex in the graph.
However, in our model the graphs are build from natural language texts, and may include multiple or partial links between the units (vertices) that are extracted from text. It may be therefore useful to indicate and incorporate into the model the “strength” of the connection between two vertices Vi and Vj as a weight wij added to the corresponding edge that connects the two vertices.
The task of a keyword extraction application is to automatically identify in a text a set of terms that best describe the document. Such keywords may constitute useful entries for building an automatic index for a document collection, can be used to classify a text, or may serve as a concise summary for a given document. Moreover, a system for automatic identification of important terms in a text can be used for the problem of terminology extraction, and construction of domain-specific dictionaries.
The simplest possible approach is perhaps to use a frequency criterion to select the “important” keywords in a document. However, this method was generally found to lead to poor results, and consequently other methods were explored. The state of the art in this area is currently represented by supervised learning methods, where a system is trained to recognize keywords in a text, based on lexical and syntactic features.The expected end result for this application is a set of words or phrases that are representative for a given natural language text. The units to be ranked are therefore sequences of one or more lexical units extracted from text, and these represent the vertices that are added to the text graph. Any relation that can be defined between two lexical units is a potentially useful connection (edge) that can be added between two such vertices. We are using a co-occurrence relation, controlled by the distance between word occurrences: two vertices are connected if their corresponding lexical units co-occur within a window of maximum N words, where N can be set anywhere from 2 to 10 words.
The vertices added to the graph can be restricted with syntactic filters, which select only lexical units of a certain part of speech. One can for instance consider only nouns and verbs for addition to the graph, and consequently draw potential edges based only on relations that can be established between nouns and verbs. We experimented with various syntactic filters, including: all open class words, nouns and verbs only, etc., with best results observed for nouns and adjectives only. The TextRank keyword extraction algorithm is fully unsupervised, and proceeds as follows. First, the text is tokenized, and annotated with part of speech tags – a preprocessing step required to enable the application of syntactic filters. To avoid excessive growth of the graph size by adding all possible combinations of sequences consisting of more than one lexical unit (ngrams), we consider only single words as candidates for addition to the graph, with multi-word keywords being eventually reconstructed in the post-processing phase. Next, all lexical units that pass the syntactic filter are added to the graph, and an edge is added between those lexical units that co-occur within a window of words. After the graph is constructed (undirected unweighted graph), the score associated with each vertex is set to an initial value of 1, and the ranking algorithm described in above is run on the graph for several iterations until it converges – usually for 20-30 iterations, at a threshold of 0.0001.
In a way, the problem of sentence extraction can be regarded as similar to keyword extraction, since both applications aim at identifying sequences that are more “representative” for the given text. In keyword extraction, the candidate text units consist of words or phrases, whereas in sentence extraction, we deal with entire sentences. Algorithm turns out to be well suited for this type of applications, since it allows for a ranking over text units that is recursively computed based on information drawn from the entire text.
To apply our service, we first need to build a graph associated with the text, where the graph vertices are representative for the units to be ranked. For the task of sentence extraction, the goal is to rank entire sentences, and therefore a vertex is added to the graph for each sentence in the text.
The co-occurrence relation used for keyword extraction cannot be applied here, since the text units in consideration are significantly larger than one or few words, and “co-occurrence” is not a meaningful relation for such large contexts. Instead, we are defining a different relation, which determines a connection between two sentences if there is a “similarity” relation between them, where “similarity” is measured as a function of their content overlap. Such a relation between two sentences can be seen as a process of “recommendation”: a sentence that addresses certain concepts in a text, gives the reader a “recommendation” to refer to other sentences in the text that address the same concepts, and therefore a link can be drawn between any two such sentences that share common content.
The overlap of two sentences can be determined simply as the number of common tokens between the lexical representations of the two sentences, or it can be run through syntactic filters, which only count words of a certain syntactic category, e.g. all open class words, nouns and verbs, etc. Moreover, to avoid promoting long sentences, we are using a normalization factor, and divide the content overlap of two sentences with the length of each sentence.
At the end we will have a list of keywords which will be used as tags and one selected sentence for full text search approach.

All code is available on GitHub. Good luck!


Author: Volodymyr Pavliukevych

Senior Software Engineer, Data Scientist.

Senior Software Engineer, Data Scientist.

Comments are closed.