The place of neural networks in Data Science: a brief educational program and recent trends

18.11.2019

In this article, we will consider one of the most popular concepts today - neural networks, which, in fact, have become the “face” of modern artificial intelligence. How this Machine Learning method works, why it became so popular in the 21st century, what deep neural networks are and what to expect from them in the future - read in this article.

A brief history of neural networks

First of all, we note that neural networks are not the only methods of machine learning and artificial intelligence. In addition to the neural networks in the training class with the teacher, the methods of correction and back propagation of error (backpropagation), as well as the support vector machine (SVM, Support Vector Machine) are highlighted, the application of which in the single-class classification problem I described here. Among the ML methods, there are teaching without a teacher (alpha and gamma systems of reinforcement, the method of nearest neighbors), training with reinforcement (genetic algorithms), partial, active, transinductive, multitask and multivariate training, as well as boosting and Bayesian algorithms [1 ].

However, today it is neural networks that are considered the most common ML-tools. In particular, they are widely used in image recognition, classification, clustering, forecasting, optimization, management decision-making, data analysis and compression in various application areas: from medicine to economics.

The history of neural networks in IT begins in the 40s of the last century, when American scientists McCulloch, Pitts and Wiener described this concept in their writings on the logical calculus of ideas, nervous activity, and cybernetics with the goal of representing complex biological processes in the form of mathematical models.

It is worth mentioning the theoretical basis of neural networks in the form of the Kolmogorov-Arnold theorem on the representability of continuous functions of several variables by a superposition of continuous functions of one variable. This theorem was proved by Soviet scientists A.N. Kolmogorov and V.V. Arnold in 1957, and in 1987, was transferred by the American researcher Hecht – Nielsen for neural networks. It shows that any function of many variables of a sufficiently general form can be represented using a two-layer neural network with direct full connections of neurons of the input layer with neurons of the hidden layer with previously known limited activation functions (e.g., sigmoidal) and output layer neurons with unknown activation functions.

From this theorem, it follows that for any function of many variables there exists a neural network that maps to it of a fixed dimension, and three parameters can be used in training it [2]:

the range of values of sigmoidal activation functions of neurons of the hidden layer;
tilt of the sigmoid neurons of this layer;
view of the activation functions of neurons of the output layer.

Another key milestone in the history of neural networks was the invention of the percentron by Frank Rosenblatt in the 60s. XX century. Due to the successful results of using perceptrons in a limited range of tasks (weather forecasting, classification, handwriting recognition), neural networks have become very popular among scientists around the world. For example, in the USSR, scientists of the scientific school of M. M. Bongard and A. P. were engaged in neural networks at the Institute of Information Transmission Problems. Petrova (1963-1965 gg.). However, since the computing power that existed at that time could not effectively implement theoretical algorithms in practice, the research enthusiasm for these ML methods temporarily fell.

The next wave of interest to neural networks began 20 years later, in the 80s of the last century and, in fact, continues to this day. It is worth noting here the various variations of the Kohonen and Hopfield networks that have developed in the deep learning model — trends, which we will discuss in more detail later [3].

What are neural networks and how do they work

Let's start with the classical definition: an artificial neural network is a mathematical model with software or hardware implementation, built on the principle of the organization and functioning of biological neural networks - nerve cells of a living organism. A neural network is a system of connected and interacting simple processors (artificial neurons), where each of them works only with signals that it periodically receives and sends itself to other neurons [3].

A characteristic difference of neural networks from other computational models is their orientation on biological principles, so they have the following qualities:

mass concurrency;
distributed presentation of information and computing;
ability to learn and generalize;
adaptability;
property of contextual information processing;
error tolerance;
low power consumption.

The above properties allow the use of neural networks for solving complex problems with incomplete input data or the absence of clearly defined algorithmic rules, for example, for forecasting (weather, exchange rates, emergencies, etc.), pattern recognition (images, video and audio streams), classification, management decision making, optimization, data analysis, etc. It is noteworthy that neural networks can be used in almost any field of industry and art, from the oil and gas sector to music.

As we noted above, the rules for the operation of neural networks are not programmed, but are developed in the learning process, which ensures the adaptability of this ML-model to changes in input signals and noise. Technically, training a neural network consists in finding the coefficients of connections between neurons, while the network is able to identify complex relationships between input and output, as well as perform generalization [3].

A typical neural network consists of three components (Fig. 1):

the input nodes, the neurons of which take the initial vector of values encoding the input signal and transmit it to the next layer, amplifying or weakening;
hidden (intermediate) nodes that perform basic computing operations;
output nodes, the neurons of which are the outputs of the network and sometimes can also perform any calculations.

Fig. 1. The structure of the neural network

Each neuron of the previous node transmits signals to the neurons of the subsequent one by the method of direct or reverse propagation of the error through synaptic connections with weighting factors (Fig. 2). Figure 2 shows a circuit of an artificial neuron, where

1- neurons, the output signals of which are input to this (x_i);
2- adder of input signals;
3- calculator of the activation function;
4- neurons, the inputs of which are given the output signal of this
5- w_i— input signal weights

Mathematicall neuron is an adder, the result of which y = f (u) is determined through its inputs and weights [4]:
, where

x_i are the possible values of the signals at the inputs of the neuron, which lie in the interval [0,1] and can be discrete or analog;
w_i - weighting factors by which the value of the input signal of the neuron x_i is multiplied for its initialization - displacement of the activation function along the horizontal axis in order to form a sensitivity threshold.
x₀ is an additional input of a neuron, the signal from which is multiplied by the weight w₀.

Thus, the output of each neuron is the result of its nonlinear continuous and monotonic activation function: sigmoid, sinusoid, Gaussian, step, and the like (Fig. 3). The activation function determines the dependence of the signal at the output of the neuron on the weighted sum of signals at its inputs. Due to its nonlinearity, neural networks with a fairly small number of nodes and layers can solve rather complex problems of forecasting, recognition and classification. Various activation functions differ from each other in the following characteristics [5]:

range of values, for example, from minus to plus infinity or a limited range of the type [0,1], (-π / 2; π / 2), etc. - with a limited set of values of the activation function, gradient learning methods are more stable, because reference representations significantly affect only limited relationships (their weight coefficients). If the range of values of the activation function is infinite, then learning is more effective, because most weights are used, however, in this case the neural network learns more slowly.
the order of smoothness, which determines the continuity of its derivative, which allows the use of optimization methods based on gradient descent and provides a higher degree of generality;
monotony (of the function itself and its derivative), which means the nature of the decrease or increase in the entire domain of definition. For a monotonous activation function, the error surface associated with a single-level model is guaranteed to be convex in which the local extremum (minimum or maximum) coincides with the global one, which is important for optimization problems.
approximation of an identical function near the origin - in the presence of this property, the neural network will be trained effectively if its weights are initialized by small random values.

As a rule, for each type of problem being solved and the network topology there is its own activation function. For example, in multilayer perceptrons, a sigmoid is used in the form of a hyperbolic tangent, which is well normalized, amplifying weak signals and not going away from an infinite increase from strong ones. In radial basis networks, the most commonly used are Gaussian, multiquadratic or vice versa multiquadratic activation functions, the parameters of which allow you to adjust the divergence of the radius by adjusting the scope of each neuron [5].

Fig. 3. The most common activation functions

Artificial neurons are combined in networks with different topologies, depending on the problem being solved (Fig. 4). For example, perceptrons and convolutional neural networks (training with a teacher), adaptive resonance networks (without a teacher) and radial basis functions (blended learning) are often used for pattern recognition. For data analysis, Kohonen networks (a self-organizing map and networks of vector quantization of signals) are used. Also, the nature of the training dataset affects the choice of network type. In particular, when forecasting time series, an expert assessment is already contained in the source data and can be distinguished during its processing, therefore, in these cases, a multilayer perceptron or a Word network can be used [3]

Fig. 4. Popular neural network topologies

Current trends in the development of neural network technologies

So, the current neural network technologies have been developing especially actively since the 2000s, when the power of graphic processors became sufficient for quick and inexpensive training of neural networks, and a large number of training datasets have accumulated in the world. For example, until 2010, there was no database to properly train neural networks to solve the problems of recognition and classification of images. Therefore, neural networks were often mistaken in confusing a cat with a dog, or a snapshot of a healthy organ with a patient. However, with the advent of ImageNet in 2010, which contained 15 million images in 22 thousand categories and was available to any researcher, the quality of the results improved significantly. In addition, by this time, new achievements of scientists in the field of artificial intelligence had appeared: Jeffrey Hinton implemented pre-training of the network using the Boltzmann machine, training each layer separately. Ian LeCan suggested using convolutional neural networks for image recognition, while Joshua Benjio developed a cascading auto-encoder that allowed all layers to be used in a deep neural network [6]. It is these studies that formed the basis of modern trends in the development of neural network technologies, the most significant of which are the following:

deep learning (DL) is a hierarchical combination of several learning algorithms (with a teacher, without a teacher, with reinforcement), when the neural network is first trained on a large amount of general data, and then purposefully trained on datasets specific to a particular task.
hybrid learning - a combination of DL-models with Bayesian approaches, which are well suited for probabilistic modeling and calculation of cause-effect relationships in bioinformatics (genetic networks, protein structure), medicine, document classification, image processing and decision support systems [7]. Bayesian algorithms significantly improve the quality of training, contributing to the generation of training data that is as close to real as possible in generative adversarial networks (GAN, Generative adversarial network) [8].
automatic learning (AutoML) - automation of all ML-processes, from preliminary data preparation to the analysis of simulation results. AutoML tools (Google AutoML, Auto Keras, RECIPE, TransmogrifAI, Auto-WEKA, H2O AutoML and other frameworks and libraries) significantly facilitate the work of Data Scientist, saving his time by automatically constructing features, optimizing hyperparameters, finding the best architecture, selecting channels and evaluation metrics, error detection and other ML-procedures [9]. AutoML can also be considered as a way of democratizing AI, which allows you to create ML-models without complex programming [10].

Next, we will examine in more detail the methods of deep and automatic ML.

Deep learning: principles of work and results of use

Deep learning includes a class of ML-models based on learning by representations (feature / representation learning), and not on specialized algorithms for specific tasks. In DL, multilayer neural networks play the role of a multilevel system of non-linear filters for feature extraction. DL models are characterized by a hierarchical combination of several learning algorithms: with a teacher, without a teacher, with reinforcement. The architecture of the neural network and the composition of its nonlinear layers depends on the problem being solved. In this case, hidden variables are used, organized in layers, for example, nodes in a deep network of trust and a deep limited Boltzmann machine. In this case, regardless of architectural features and applications, the entire DL-networks are characterized by preliminary training on a large amount of general data with subsequent adjustment on datasets specific to a particular application [11].

For example, one of the most famous implementations of DL-models, the BERT neural network, which I talked about here, is pre-trained on Wikipedia articles, and then used in text recognition. According to a similar principle, the XLNet neural network is also used in natural language processing for text analysis and generation, data extraction, information retrieval, speech synthesis, machine translation, automatic abstracting, annotation, simplification of text information and other NLP problems [12]. Another deep neural network, CQM (Calibrated Quantum Mesh), also shows excellent results (more than 95%) in understanding natural language, extracting the meaning of a word based on a probabilistic analysis of its environment (context) without using predetermined keywords [13]. Figure 5 shows the use of a previously prepared model as objects in a separate downlink DL network during transfer training in NLP problems [14].

Fig. 5. The scheme of transfer training in DL-networks

Among other DL-models, capsular networks are worth mentioning, which, unlike convolutional networks, process visual images taking into account the spatial hierarchy between simple and complex objects, which increases the accuracy of classification and reduces the amount of data for training [15]. We also note deep learning with reinforcement (DRL, Deep Reinforcement Learning), working on the principle of interaction of the neural network with the environment through observations, actions, fines and rewards [16]. DRL is considered the most universal of all ML methods, so it can be used in most business applications. In particular, the AlphaGo neural network refers to DRL models, which in 2015 defeated a person for the first time in competitions in the ancient Chinese game of go, and in 2017 defeated the strongest professional player in the world [6].

And for the recognition of images, obtaining photorealistic images, improving the quality of visual information and ensuring cybersecurity, GAN-like networks are actively used. A GAN network is a combination of two competing neural networks, one of which (G, generator) generates samples ， and the other (D, Discriminator) tries to distinguish between correct ("genuine") samples from incorrect ones, processing all the data. Over time, each network improves, so the quality of data processing increases significantly, because the training process has already incorporated the function of interference processing. This non-teacher DL learning system was first described by Google's Goodfellow from Google in 2014, and the idea of competitive learning was put forward in 2013 by scientists Li, Gauci, and Gross. Today GAN-networks are actively used in the video industry and design (for training film or animation frames, computer game scenes, creating photorealistic images), as well as in the space industry and astronomy (for improving images obtained from astronomical observations) [17]. Thus, GAN networks are excellent for a wide range of teacher-less training tasks where labeled data does not exist or the process of preparing them is too expensive, for example, creating a 3D image of a remote object based on its fragmentary images. Thanks to a competitive approach, this ML-method is faster than similar DL-models, because two networks are used at once with opposite local goals aimed at a common global result [10].

AutoML: Tooling Features

AutoML's value lies in the “democratization" of machine learning: this technology allows business users, not just experienced Data Scientists, to work with powerful analytics and forecasting tools. AutoML tools are aimed at simplifying the creation and application of complex algorithms as much as possible: thanks to simplified user interfaces, they allow you to quickly create the necessary models, reducing the likelihood of erroneous calculations. AutoML-systems are designed to process a large amount of data, including preliminary preparation of datasets: users can independently identify tags, select the necessary sections of the information set in UI mode. This approach is significantly different from the traditional work of Data Scientist’s, allowing you to focus on a business task, rather than data processing issues. Many ML platforms are compatible with Android and iOS, so the models can be seamlessly and quickly integrated with mobile applications [18]. Among the most famous AutoML-solutions it is worth noting the following:

Microsoft Azure ML is a cloud service that allows users to upload their data using a graphical interface for developing code, which itself creates neural networks for data analysis. The program works on any device, requiring the user to have a minimum of knowledge about data or programming. The service includes Machine Learning Studio and Machine Learning API Service tools that allow you to create models that determine the likelihood of an event using data stored in SQL Server and other platforms, including Azure HDInsight (implementation of Apache Hadoop from Microsoft) [18] .
Google Cloud AutoML helps you independently create and train DL-models for recognizing images and objects, as well as NLP tasks, including on machine translation. In particular, AutoML Vision includes recognition of faces, tags, handwriting, logo, content moderation, optical character recognition and other similar tasks of processing graphic information. For example, Google Cloud ML classifiers are used at TAMUCC's Texas Garth Research Institute to identify attributes in large datasets of coastline images along the Gulf of Mexico. AutoML Natural language Toolkit allows you to classify content into specific categories, extracting semantic structures from the text and recognizing its elements. This allows you to quickly search, translate, annotate, compress and filter text data, which is what Rewordify, Hater News, Smmry, Grammarly, Google WebSpeech, Vocalware and many other businesses use successfully. However, while most of the features of Google Cloud AutoML are available and work in full only in English. Google also offers the AutoML Data Science Toolkit - a whole set of tools for working with large data volumes of data. Thanks to the use of this tool, the Japanese marketplace Mercari increased the classification efficiency of its branded products by 16%, achieving a product recognition accuracy of 91.3% [19].
Amazon SageMaker offers an easy-to-use service to accelerate the development of ML models and deploy them on the AWS cloud platform. The system includes ready-made algorithms, as well as an integrated Jupyter command shell, allowing you to use both general algorithms and training frameworks, and create your own using Docker containers. For accelerated training of ML models, they can be replicated on several instances of the cloud cluster [18].
Auto-Keras - unlike the AutoML systems listed above, this open source library developed at Texas A&M University provides functions for automatically searching for architecture and hyperparameters of DL models. In fact, Auto-Keras is a regular Python package that requires an ML library, for example, Theano or Tensorflow. Auto-Keras automatically searches for the architecture of a neural network, selecting its structure and hyperparameters for the optimal solution of a given task (classification, regression, etc.), which greatly facilitates the process of ML modeling [20].

Instead of a conclusion

A brief summary of the position of neural networks in modern Data Science:

neural networks are not the only, although perhaps the most popular and most promising today ML-method;
Despite the almost 80-year history of existence, the mathematical apparatus of neural networks began to be actively used in practice only in the 21st century, with the advent of fast and relatively cheap computing power that could effectively realize the ideas embodied in it;
there is a clear tendency towards the democratization of ML in general and neural networks in particular, which are increasingly being used not only in scientific research, but also in various business tasks and entertainment applications;
It is also worth noting the trend towards simplifying the work of Data Scientist’a in connection with the advent of AutoML tools that automate many stages of modeling, including complex weighting settings, optimization of hyperparameters and other labor-intensive procedures.

So, the most obvious of modern ML-trends are deep learning neural networks and AutoML tools, which ensure the growth of popularity of DS, including among ordinary people, which, in turn, leads to the expansion of applied use and the further development of all AI methods.

Sources

Machine learning
Kruglov V.V., Borisov V.V. Artificial neural networks. Theory and practice. - 2nd ed., Stereotype. - M .: Hotline-Telecom, 2002 .-- 382 p.
Artificial neural network
Artificial neuron
Activation function
Neural networks: how artificial intelligence helps in business and life
Bayesian network
Bayesian Conditional Generative Adverserial Networks
AutoML
Top 10 Trends in Artificial Intelligence (AI) Technologies in 2018
Deep learning
NLP - Natural Language Processing
CQM - Another Look in Deep Learning for Search Engine Optimization in Natural Language
The State of Transfer Learning in NLP
Dynamic Routing Between Capsules. Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton
Reinforcement learning
Generative adversarial network
AutoML Helps Deal With Shortage of AI Professionals
Google AutoML and new machine learning features
Libraries for Deep Learning: Keras

Alexei ChernobrovovConsultant on Analytics and Data Monetization