0 0 0 0

Introduction to mxnet design and implementation

Declaration: this article is bound by the author.

Original link: https // Hub. com/dmlc/mxnet/issues/797.

Author: @

Academic paper download

The neural network is essentially a language, and we use it to express the understanding of the problem. For example, we're using convolution layers. space Correlation, rnn to table. . Continuity. According to the complexity of the problem and how to extract from input to output from input to output, we connect different sizes to a certain principle. In recent years, data A large increase in power and computing power, the neural network becomes more and more deep and larger. For example, a recent few imagnet races use a network with dozens of. For this kind of neural network, we usually call depth learning. From the point of view of application, how to express the neural network conveniently and how to train the model is very important.

For a good depth learning system, or a wider computational system, the most important programming interface is the design. They all use a domain specific language ( domain specific language ) embedded in a subject sentence. For example, numpy embeds matrix operations into In general, such embedding is generally divided into two, one of which is the lighter, each of which is executed in the original meaning, and usually imperative programming ( imperative programming ), which is in the case of this type. The other is a deep embedded approach to provide a complete set of languages. In this case, a declarative programing is usually used, which only needs to declare what to do, and the specific execution is done by the system. Such systems include caffe, theano and tensorflow just released.

These two ways are valid, . As follows

Shallow embedding, imperative programming deep embedding, declarative programming
How to perform a = b + 1Need b has been assigned. Perform the addition immediately, save the result in a. Returns the corresponding computed graph ( graph graph ), which we can assign to b after we can then perform the addition operation again
AdvantagesSemantically easy to understand, flexible, precise control of behavior. It's often possible to seamlessly interact with the main language, easily utilize various algorithms, toolkits, bugs and performance. debugging Device.When you really start computing, you've got the entire diagram, so we can do a series of optimizations to improve performance. It's also easy to implement accessibility functions, such as providing forward and backward functions for any diagram, visualization of calculated graphs, saving.
DrawbacksIt's difficult to implement unified auxiliary functions and provide overall optimization. Many of the main language features don't work. Some of them are simple in the subject language, but are often cumbersome here, such as if else statements. Debug isn't easy, such as monitoring the intermediate results of a node in a complex computational diagram.

Currently, most existing systems use one of the two programming patterns. Unlike them, mxnet tries to integrate two modes seamlessly. In imperative programming, mxnet provides tensor operations, while mxnet supports symbolic expressions in declarative programming. Users can freely mix them to quickly implement their own ideas. For example, we can use declarative programming to describe neural networks and use the automatic derivation of the system. Another convenient, iterative training and updating model rules may involve a lot of control logic, so we can use imperative programming to implement it. At the same time, we use it for convenient mode and interactive data.

In the following table, we compare mxnet and other popular deep learning systems

Main language distributed imperative declaration from language hardware
CaffeC + +Python/matlabCpu/gpuxxv
TensorFlowC + +PythonCPU/GPU/Mobilevxv
MXNetC + +Python/R/Julia/GoCPU/GPU/Mobilevvv

( note that tensforflow is temporarily not exposed to its distributed implementation )

Mxnet. system As shown in the following illustration:


The unified system implementation of the two programming modes, as well as the support of each hardware, for the embedded, programming interface ( matrix operation, symbolic expression, distributed communication ), respectively. Next chapter we'll introduce the programming interface and the next chapter introduces the system implementation. After that, we give some experimental results and discuss the future.

programming interface

Symbol declarative symbolic expression

A mxnet using a output symbol expression to declare the diagram. Symbols are constructed by the operator. An operation subroutine can be a simple matrix operation"+", or a complex neural network, such as a convolution layer. An action child can have multiple input variables and multiple output variables, and can also have internal state variables. A variable can be free, and we can assign it to it. Can also be the output of an operation. For example, the following code We use julia to define a sense, which consists of a free variable that represents input data, and a series of neural networks.

using MXNet mlp = @mx.chain mx.Variable(:data) => mx.FullyConnected(num_hidden=64) => mx.Activation(act_type=:relu) => mx.FullyConnected(num_hidden=10) => mx.Softmax()

Before you execute a symbolic expression, we need to assign all the free variables. In the above example, we need a given data, as well as implicitly defined inputs in each layer, such as the weight and bias of the layer. We also want to declare the output required, such as softmax output.

In addition to executing the softmax output ( usually called forward ), the symbolic expression also supports automatic derivation to obtain the gradient corresponding to. In addition, we can estimate the memory, symbol expression visualization, read and output in advance of the calculation.

NDArray imperative sheet calculation

A mxnet provides imperative to bridge the main language and symbol expressions. In the following code, we calculate the multiplication of matrices and constants on the gpu, and print the results using numpy

>>> import MXNet as mx>> > a = mx.nd.ones((2, 3),.. . mx.gpu())>> > print (a * 2).asnumpy() [[ 2. 2. 2.] [ 2. 2. 2.]]

On the other hand, ndarray can seamlessly and. Assuming we define a neural network using symbol, we can implement a gradient descent algorithm as follows

for (int i = 0; i <max_iter; ++i) { network.forward(); network.backward(); network.weight -= eta * network.gradient }

This gradient is calculated by symbol. Symbol output is expressed as ndarray, and we can update the weight by the amount provided by ndarray. In addition, we use the for loop of the main language to iterate, and the learning rate is modified in the subject language.

The above hybrid implementation differs from the performance of using pure symbol expressions, and the latter is more complex when expressing control logic. Its reason is that the execution of ndarray will build a calculation diagram similar to symbol, and it's executed with other operations. For an operation - = We don't need to get the result immediately because we just pass the result to the forward of another symbol. As the above for loop ends, we simply submit a calculation graph for several symbol and ndarray to the background engine. When we eventually need the result, the program is blocked until the weight is copied to the subject language or when it's saved to the disk.

KVStore data interaction between multiple devices

Mxnet provides a distributed. . - value Storage for data exchange. It has two functions.

1. Push: push the key value to the storage from one device.

2. Pull: pull the value on a key from the store and also accept the custom update function to control how the received value is written to the store. Finally, kvstore provides several parts of the final consistency model and sequence Data consistency model in model.

In the following example, we change the previous gradient descent algorithm to distributed gradient.

KVStore kvstore("dist_async"); kvstore.set_updater([](NDArray weight, NDArray gradient) { weight -= eta * gradient; }); for (int i = 0; i <max_iter; ++i) { kvstore.pull(network.weight); network.forward(); network.backward(); kvstore.push(network.gradient); }

Here we first create a kvstore using the final consistency model and register the update function. Before each round iteration, each calculated node will pull the latest weight back, and then the calculated gradient is pushed out. Kvstore will use the update function to update the weight of its storage using the received gradient.

Here, push and pull use a technique like ndarray. They simply submit the corresponding operation to the background engine, and the engine will schedule the actual data interaction. So the above implementation is very different from our use of pure notation.

reading data module

Data reading plays an important role in overall system performance. A mxnet provider can compress samples of arbitrary sizes into individual or multiple files to speed up sequence and random reads.

Typically data exists on a local disk or a remote file system ( for example, for example ) hdfs Or or . S3 ), every time we need to read the currently required data into memory. A mxnet provides an iterator that reads files in different formats. Iterator using To decode data and use multiple. thread to hide the overhead of file reading.

training module

And mxnet implements the common optimization algorithm to train the model. The user only needs to provide the data iterator and the symbol of the neural network. In addition, users can provide additional kvstore to perform distributed training. For example, the following code training a model using distributed asynchronous pointcuts, each of which uses two fast gpu.

import MXNet as mx model = mx.model.FeedForward( ctx = [mx.gpu(0), mx.gpu(1)], symbol = network, num_epoch = 100, learning_rate = 0.01, momentum = 0.9, wd = 0.00001, initializer = mx.init.Xavier(factor_type="in", magnitude=2.34)) X = train_iter, eval_data = val_iter, kvstore = mx.kvstore.create('dist_async'), epoch_end_callback = mx.callback.do_checkpoint('model_'))

Implementation of system


A symbolic expression that has been assigned can be represented as a calculated graph. Below is a partial calculation diagram of a previously defined, including forward and backward.


The circle represents a variable, the box represents an action child, and the arrow represents the data dependency relationship. Before executing, mxnet optimizes the diagram and applies the space for all variables ahead of time.

Optimization of graph

Computational graph optimization is already in database In many years, we've only explored several simple methods for many years.

1. Notice that we're looking forward to which output variables are required, so we only need to calculate the actions required by these outputs. For example, we don't need to calculate gradient when prediction, so the entire backforward graph can be ignored. In feature extraction, we may only need some of the output, which can ignore the subsequent calculations.

2. We can merge some operations. For example, a b + 1 * requires only a blas or cuda function without having to represent it as two actions.

3. We implement some"large"operations, such as a convolution layer that requires only one operation child. So we can greatly reduce the size of the graph and make it easy to optimize the operation.


Memory is usually an important bottleneck, especially for gpu and. Devices. A lot of temporary space is usually required when neural networks are calculated, such as input and output variables for each layer. Applying a separate space to each variable brings high memory overhead to each variable. Fortunately, we can infer the lifetime of all variables from the diagram, which is the time period that's created from the creation to the last used, so that you can reuse the same memory space for two variables. This problem is in many fields. compilation The register allocation of the device has been studied. However, the optimal allocation algorithm needs o. 2 Time complexity, where n is the number of variables.

Mxnet provides two heuristic policies, each of which is linear complexity.

1. Inplace. In this strategy, we model the. . The process, maintaining a number of other variables for each variable requires its count. When we find that the count of a variable becomes 0, we recycle its memory.

2. Co-share. We allow two variables to use the same memory space. So, of course, the two variables can't be written at the same time. So we only consider co share for variables that cannot be parallel. Every time we consider a path ( path ) in the diagram, all variables on the road are dependent so that we can't be parallel, then we do memory allocation and remove them.


In mxnet, all tasks, including, symbol execution, data communication, will be executed by the engine. First, all resource units, such as ndarray, random number generators, and temporary space, register a unique label at the engine. Then each task given to the engine will indicate the resource label it needs. The engine tracks each resource, and if the resource required by a task is in place, for example, the last task that generates this resource is done, then the engine.

Usually a mxnet run. I & tance Multiple hardware resources are used, including cpu, gpu, pcie channel, network, and disk, so the engine uses multithreading to schedule, both tasks that have no resource dependency conflict may be parallel execution to maximize resource utilization.

Unlike the usual data flow engine, the mxnet engine allows a task to modify existing resources. To ensure that the schedule is correct, you need to separately indicate which resources are and which resources are modified. This additional write dependency can bring many convenience. For example, we can easily implement array modification operations that are common in ligature and other tensor libraries, also make memory allocation easier, such as. Again, if we use the same seed to generate two random numbers, then we can mark the two operations to modify the seed at the same time so that the engine doesn't.

data communication

It's based on kvstore. parameterserver But it has two distinct differences with the previous work.

1. We're going through the engine. management Data consistency, which makes the implementation of the parameter server fairly simple, and makes the kvstore operation seamless and seamlessly combined with other.

2. We use a communication structure, as shown in the following illustration. The first server manages communication between multiple devices inside a single machine. The second server manages the communication between the machines. The fi & t layer of serve & can reduce network bandwidth co & umption before communicating with the second layer. At the same time, considering the difference between internal and external communication bandwidth and delay, we can use different consistency models. For example, the first layer of our strong consistency model, while the second layer uses weak consistency models to reduce synchronization Overhead.



Lightweight and portability is an important target of mxnet. The mxnet core uses c + +, and provides c header files. It's easy to migrate the system, also makes it easy to be invoked by other languages that support c code. In addition, we also provide a script that wraps the mxnet core functionality into a single c c + + source file, which makes it in some restricted platforms, such as smart devices, easy to compile and use, for example.

Experimental results

Here we provide some early experimental results.

is compared with other systems.

We first use a popular convolution network. . It compares mxnet with the torch, caffe and tensorflow in the past few imagenet races to win the performance. Each system uses identical cuda 7.0 and cudnn 3, but tensorflow uses its only supported cuda 6.5 and cudnn 2. We use a single gtx 980 and report a time consuming and backward.


It can be seen that the mxnet, the torch and the caffe are different in performance. This is expected because most of the operations on a single card are performed by cuda and cudnn on a single card. A tensorflow is more than 2 times slower than other three, which may be due to cudnn and project. open source

Usage of memory

Next we investigate the effects of different memory allocation algorithms on memory. The following figure shows the memory overhead on the internal variable ( removal model, initial input, and final output ) when the prediction and training are used when using batch = 128.


As you can see, both inplace and co share can greatly reduce memory usage. Combining both of them can reduce 2 times memory usage while prediction can reduce 4 memory usage during prediction. In particular, even the most complex vggnet, the mxnet requires only 16 mb extra memory when predicting a single image.


Finally, we report performance under distributed training. We use imagenet 1 k data ( 120 million 224x224x3 images, 1000 classes ) and use google net to add batch normalization. We use amazon ec2 g2.8x, both stand-alone and, and the following figure shows the convergence of a single machine and 10 g2.


From the accuracy of training, single machine convergence is faster than, because the effective batch size is larger than that of single machine. But interesting is that both of them are very similar to the test accuracy.

Every single data requires 1 4 seconds, while on a machine, only 1 thousand gigabit seconds will need to be needed. If you consider the run time comparison test precision, 10 machines bring 10 times increased.

,, and future.

In the last half, we pulled several excellent c + + machines to learn system . People set up DMLC, originally meant to be more convenient to share the code of their respective projects and provide a consistent experience to the user. At the time we've two deep learning projects, one is cxxnet. configuration To define and train neural networks. Another is minerva, which provides the same calculation interface as numpy. The former is convenient to use convolution networks, and the latter is more flexible. And then we want to have a system that can't have both functions, so there's a mxnet. Its name is xnet from minerva and cxxnet. And the idea of symbol comes from cxxnet, and ndarray 's idea comes from minerva. We also call mxnet"mix net".

A mxnet is the first project that combines all of the members 'efforts, and also attract a lot of core players to join. Mxnet aims to make an interesting system that allows you to use a convenient system, a lightweight and system that can quickly test systems and. For the future, we focus on the following four directions:

  • Support for more hardware, we're currently considering support for amd gpu, gpu, intel, fpga, and more intelligent devices. It's important to believe that mxnet 's lightweight and memory savings are available.
  • More perfect operations. At present, whether symbol or ndarray support is limited, we want to be able to be able to be able to . Expand them.
  • More programming languages. In addition to c + +, mxnet has support for python, R and julia. But we hope there are many languages, such as scr T.

More applications. We spend a lot of effort in the image classification, and we'll consider many applications. For example last week we tried to combine a new picture with a picture of a picture and a picture. The following image is used in my office window and. Ry, night.



Next we want to be able to be more applications, such as voice, tra lation And the answer, the output.

We want mxnet to make it easier for everyone to study and apply depth. And you want to be able to get more. developer Learn and progress together.

Copyright © 2011 Dowemo All rights reserved.    Creative Commons   AboutUs