BERT: a breakthrough in NLP technology or another hype on the topic of Deep Learning?

07.10.2019

Since 2013, according to Google.Trends, the popularity of the term “Deep Learning (DL) has been growing rapidly, as this Machine Learning method is becoming more and more popular in Data Science. In this article, I will briefly talk about the advantages of deep learning of neural networks using the BERT model and the possibilities of its practical applications in natural language processing (NLP, Natural Language Processing), as well as compare it with some other similar technologies.

What is deep learning and how BERT is connected with it

Let's start with the definition: deep learning is a machine learning method based on teaching representation (feature / representation learning), rather than specialized algorithms for specific tasks. In DL, multilayer neural networks act as the means of implementing a multilevel system of nonlinear filters to extract features. DL models are characterized by a hierarchical combination of several learning algorithms: with a teacher, without a teacher, with reinforcement [1]. In particular, BERT is a neural network that is used for specific problems of natural language processing after preliminary training on a huge amount of data with simple tasks.

Note that the ideas of deep learning were known during the formation of the concept of “artificial intelligence” (AI) in its modern sense, the 70-80s. XX century, when multilayer perceptrons and other models of neural networks began to be used in real pattern recognition systems. However, the practical effectiveness of these methods at that time was not high enough, because existing hardware capabilities of computer technology did not allow to realize the complex architecture of such networks. Thus, the active development of DL-methods stopped until the mid-2000s, when the power of graphic processors grew so much that training neural networks became a relatively quick and inexpensive process, and the global amount of data has accumulated a significant number of datasets for training networks. In addition, by this time, the work of scientists in the field of AI (Hinton, Osindero, Te, Benjio) showed the effectiveness of multi-level training of neural networks, in which each layer is trained separately using a limited Boltzmann machine, and then using the backpropagation method [ 1]. I’ll tell you more about the history of development and the principles of operation of neural networks in my new article, and now we turn to BERT.

BERT is a bidirectional multilingual model with a transformer architecture (Fig. 1), designed to solve specific NLP problems, for example, determining the emotional coloring (tonality) of a text, question-answer systems, classification of texts, drawing conclusions from the text, etc. . In addition to speech recognition, the classic NLP task is text analysis, including data extraction, information retrieval, and sentence analysis. Natural language processing also includes text generation, speech synthesis, machine translation and automatic abstracting, annotation and simplification of textual information. Thus, the purpose of applying NLP technologies is not only the recognition of a living language by means of artificial intelligence, but also the possibility of adequate interaction with it. The latter, in fact, means the AI-tool understands oral or written speech [2].

The BERT model is pre-trained without a teacher on 2 NLP tasks: modeling language masks and predicting the next sentence. BERT's work is based on the latest advances in neural networks and NLP technologies published in 2018 (ELMo, ULMFiT, OpenAI Transformer и GPT-2), which allow you to pre-train language models without a teacher on large data packages and then adjust them to specific problems [3]. Thanks to this, with the help of BERT it is possible to develop effective AI-programs for processing a natural language: answer questions asked in any form, create chat bots, perform automatic translations, analyze text, etc. [4]. In 2019, Google Research released its tensorflow open source BERT implementation, including several pre-trained multilingual models with many layers, nodes, outputs, and parameters. In particular, the multilingual BERT-Base model, supporting 104 languages, consists of 12 layers, 768 nodes, 12 outputs and 110M parameters [3].

The practice of using BERT and other neural networks in NLP tasks: some examples

One of the classic NLP cases is the task of text classification, which assumes that each document belongs to one or more classes (labels). However, in practice, the text can be simultaneously and independently assigned to several classes (Fig. 2), for example, the classification of goods at the enterprise, the definition of the genre of a film or literary work, thematic sorting of emails depending on the motives of their contents [3].

Fig. 2. Binary and multiclass classifications

For example, the analysis of user reviews of films or products in the online store can be used by referral systems of these businesses in order to encourage customers to make new purchases. And the task of determining the emotional coloring of the text and analysis of its content is relevant in the field of corporate reputation management, namely in SERM marketing (Search Engine Reputation Management), aimed at creating a positive image of the company by influencing search results using PR, SMM and SEO tricks.

Today, automated data collection about company mention and their initial analysis is carried out by specialized SERM systems with a different set of functions and cost, from free online services to commercial solutions. These tools analyze the issuance of keywords with the brand name in search engines, price aggregators, thematic portals, on sites with reviews and recommendations, as well as in social networks and videos [5].

As a vivid example illustrating the capabilities of ML in this NLP context, let us cite the case of Sberbank, which analyzed user reviews of the Google Play store about its mobile application in order to identify incidents and prevent them. An analysis was made of 882,864 user reviews left between October 2014 and October 2017. Only negative recommendations (1-2 stars) were used to determine the topic of incidents, but the whole sample was used to train the ML model. To forecast an acceptable level of negative reviews for the selected date, a 3-month interval was chosen before it. The prediction was built a week in advance from the selected date, with discretization in one day [6].

The anomaly was recorded when the actual number of negative reviews exceeded the confidence level. The confidence level is the sum of the predicted value and the confidence interval. Figure 3 shows the actual number of reviews in red, and the predicted value of the normal level with a confidence interval in yellow [6].

Fig. 3. Schedule of analysis of reviews on the Sberbank mobile application

Further analysis was carried out by dates, which account for 5 distinct peaks. Texts of reviews from these dates were clustered by keywords describing the essence of the problem, for example, “connection”, “SMS”, “update”, etc. Based on the results of clustering, problems were identified on the following topics [6]:

application work related to version upgrade;
Login to the application after the update;
login to the app and privacy policy;
application operation related to connection with the bank;
SMS sending to the user with a code;
money transfer;
application interface
application work related to the built-in antivirus.

Note that the trained model of machine learning was able to perform not only post hoc analysis, but also worked ahead of the curve, i.e. predicted an increase in problems of a certain category on separate dates. In the future, this technique can be used not only to prevent incidents related to the operation of the mobile application, but also for other SERM-tasks [6].

Thematic modeling vs vector NLP technology

In the above case of Sberbank, the BigARTM open source library was used for thematic modeling of large collections of text documents and arrays of transactional data. This technology of statistical analysis of texts to automatically identify topics in large collections of documents determines which topics each document relates to and with what words each topic is described. At the same time, manual markup of texts is not required, and ML-model training takes place without a teacher. Thematic modeling allows multiclass classification, i.e. so that the document relates simultaneously to several clusters-topics, and allows you to answer the questions "what is this text about" or "what common topics does this pair of texts have." The thematic model forms a concise vector representation of the text, which helps to classify, categorize, annotate, segment texts. Unlike the well-known vector representations of the x2vec family (word2vec, paragraph2vec, graph2vec, etc.), in thematic vectors, each coordinate corresponds to a theme and has a meaningful interpretation. The thematic model attaches to each topic a list of keywords or phrases that describes its semantics [7].

Unlike the thematic model, BERT works on the principle of vector representation of words based on contextual proximity, when words that appear in the text next to identical words (and, therefore, have similar meanings) in the vector representation will have close coordinates of the vectors. The obtained vectors can be used for natural language processing and machine learning, in particular, for predicting words [8]. Thus, BERT also performs predictive functions, in contrast to the thematic ML-model. This property of vector NLP technologies can be used in some specific problems of text analysis, for example, to determine authorship. Each person is characterized by certain specific phrases, cliches and other lexical constructions that can be grouped into stable vectors and calculate the frequency of their repetition in certain texts, determining the author's affiliation.

How the BERT neural network works: architecture and principle of operation

In order to train BERT to predict words, phrases are sent to the input of the neural network, where some of the words are replaced by the [MASK] mask. For example, having received the sentence “I came to [MASK] and bought [MASK]” at the entrance, BERT should show the words “shop” and “milk” at the exit. This is a simplified example from the official BERT page, on longer sentences the range of possible options becomes smaller, and the response of the neural network is more definite. In order for a neural network to learn to understand the relationships between different sentences, it is necessary to additionally train it to predict whether the second phrase is a logical continuation of the first [4].

When submitting text to the input of the BERT-model, it is first tokenized - divided into smaller units (tokens): paragraphs are divided into sentences, sentences into words, etc. (Fig. 4). The input text is divided into a list of tokens available in the dictionary. For example, the BERT-Base model already mentioned uses a dictionary of 30,522 words. If the word is not in the dictionary, then it is gradually broken up into smaller parts that are already in the dictionary. Thus, the context of the new word will be a combination of the meanings of its parts [3].

Thus, BERT is an autoencoder (AE) that hides and modifies some words, trying to restore the original sequence of words from the context. This leads to the following disadvantages of the model [9]:

each hidden word is predicted individually, which is why information about possible connections between masked words is lost, for example, “New York” is a stable combination of words, when divided into independent parts, the original meaning is completely lost;
the mismatch between the phases of training and the use of the pre-trained BERT model: hidden words ([MASK]) are used during training, and when using the pre-trained model, such tokens are no longer fed to its input.

However, despite the above problems, BERT is called the latest state-of-the-art achievement in the NLP domain.

Test Results and Alternatives to BERT

Test studies on the BERT assessment conducted in 2019 proved the effectiveness of this DL-method, reaching the highest marks in the classic tests for understanding natural language. In particular, the ALBERT recognition model created in Google AI earned 92.2 points in the Stanford SQUAD test, in which it was necessary to answer questions, and 89.4 points in the test for evaluating the fidelity of understanding the GLUE language. In July 2019, Facebook introduced its own BERT-based RoBERTa ML-model, and a month earlier Microsoft AI showed its similar MT-DNN DL-neural network, which achieved the highest results in 7 out of 9 GLUE tests [10].

However, BERT is not the only DL network that shows excellent results in solving NLP problems, although it may be the most popular one. XLNet, the well-known ML-model of multi-layer transformer architecture, shows the best performance in comparison with BERT in the RACE (Reading Comprehension From Examinations) test. Figure 5 shows the results of the accuracy of understanding the contents of the text on two datasets of different sizes: medium (Middle) and large (High). BERT and XLNet networks had 24 layers and were similar in size. In other problems of text classification, the XLNet DL model also showed better results (Fig. 6) [11].

Fig. 5. Results of neural network models on the RACE problem

Fig. 6. The results of neural network models for other problems of text classification

These excellent XLNet results are due to the following factors [11]:

XLNet does not mask words in a sequence, so there is no problem of model mismatch on pre-training and tuning for a single task, which is typical of BERT;
unlike typical autoregressive models, XLNet does not use fixed direct and reverse factorization orders. Instead, XLNet maximizes the expected logarithm of the probability of a word sequence, taking into account all permutations of word orders. Thanks to the permutation step, the context for each position in the sequence can consist of words on the right and left sides. Thus, the word at each position in the sequence uses contextual information from all other positions (bidirectional context).

As a result, XLNet integrates the properties of autoregressive language models and autocoders, bypassing the shortcomings of both methods.

Summary

Summing up the implementation of deep learning ideas in the form of BERT-like models for natural language processing tasks, several conclusions can be drawn:

NLP technologies are actively used in modern marketing, PR and corporate reputation management;
preliminary training of ML-models without a teacher on large data packages with subsequent adjustment to specific tasks can significantly speed up the training process and get final results with a high degree of accuracy;
BERT and other similar neural networks are effective in the case of typical NLP tasks and can act as a “semi-finished product” for creating a chat bot or recommendation system, however, even if they have the state-of-the-art status, they are not yet able to completely replace themselves person;
XLNet continues the ideas of BERT, however, it is free from drawbacks of this architecture, which prevents the problem of inconsistency of the ML-model in pre-training and subsequent training for the specifics of the task, as well as when used in "real conditions" (production).

Sources

Alexei ChernobrovovConsultant on Analytics and Data Monetization