Forward propagation as well as backpropagation leads to some operations on matrixes. The most common one is a matrix multiplication. In order to perform matrix multiplication in reasonable time you will need to optimise your algorithms.

There is a simple way to do it on macOS by means of their Accelerate Framework . Actually this is an umbrella framework for vector-optimized operations:

  • vecLib.framework – Contains vector-optimized interfaces for performing math, big-number, and DSP calculations, among others.
  • vImage.framework – Contains vector-optimized interfaces for manipulating image data.

Cblas_sgemm function can help you reach really hight performance.

Actually, vecLib  is only a ported version of two libs BLAS and LAPACK.

cblas.h and vblas.h are the interfaces to Apple’s implementations of BLAS. You can find reference documentation in BLAS. Additional documentation on the BLAS standard, including reference implementations, can be found on the web starting from the BLAS FAQ page at these URLs: http://www.netlib.org/blas/faq.html and http://www.netlib.org/blas/blast-forum/blast-forum.html.

clapack.h is the interface to Apple’s implementation of LAPACK. Documentation of the LAPACK interfaces, including reference implementations, can be found on the web starting from the LAPACK FAQ page at this URL: http://netlib.org/lapack/faq.html

This is a good way to combine your code with C++ library on Linux and macOS platforms.

One more way is to use CPU SSE and SIMD (Single Instruction Multiple Data) instructions.

Each (modern) CPU work is based on parallel chunks processing. SIMD instruction can operate on all loaded data in a single operation. In other words, if the SIMD system works by loading up eight data points at once, the add operation being applied to the data will happen to all eight values at the same time. This parallelism is separate from the parallelism provided by a superscalar processor; the eight values are processed in parallel even on a non-superscalar processor, and a superscalar processor may be able to perform multiple SIMD operations in parallel.

More about that: CPU memory and Streaming SIMD Extensions – Matrix Multiplication.

And simple test on swift to use that code:

Don’t forget to compile that code with the right optimization level and turn on SSE instructions. For example:

As a result you have two cross platform algorithms with hight performance.

One more way is a GPU computing. I will write about it in the next post.

Author: Volodymyr Pavliukevych

Senior Software Engineer, Data Scientist.

Senior Software Engineer, Data Scientist.

No Comments so far.

Leave a Reply