Forward propagation as well as backpropagation leads to some operations on matrixes. The most common one is a matrix multiplication. In order to perform matrix multiplication in reasonable time you will need to optimise your algorithms.
There is a simple way to do it on macOS by means of their Accelerate Framework . Actually this is an umbrella framework for vector-optimized operations:
- vecLib.framework – Contains vector-optimized interfaces for performing math, big-number, and DSP calculations, among others.
- vImage.framework – Contains vector-optimized interfaces for manipulating image data.
Cblas_sgemm function can help you reach really hight performance.
Actually, vecLib is only a ported version of two libs BLAS and LAPACK.
cblas.h and vblas.h are the interfaces to Apple’s implementations of BLAS. You can find reference documentation in BLAS. Additional documentation on the BLAS standard, including reference implementations, can be found on the web starting from the BLAS FAQ page at these URLs: http://www.netlib.org/blas/faq.html and http://www.netlib.org/blas/blast-forum/blast-forum.html.
clapack.h is the interface to Apple’s implementation of LAPACK. Documentation of the LAPACK interfaces, including reference implementations, can be found on the web starting from the LAPACK FAQ page at this URL: http://netlib.org/lapack/faq.html
This is a good way to combine your code with C++ library on Linux and macOS platforms.
One more way is to use CPU SSE and SIMD (Single Instruction Multiple Data) instructions.
Each (modern) CPU work is based on parallel chunks processing. SIMD instruction can operate on all loaded data in a single operation. In other words, if the SIMD system works by loading up eight data points at once, the add
operation being applied to the data will happen to all eight values at the same time. This parallelism is separate from the parallelism provided by a superscalar processor; the eight values are processed in parallel even on a non-superscalar processor, and a superscalar processor may be able to perform multiple SIMD operations in parallel.
More about that: CPU memory and Streaming SIMD Extensions – Matrix Multiplication.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
#include "matrix.h" #include <smmintrin.h> void matrix_multiplication(float** A, float** B, float** S, int size); void matrix_multiplication(float** A, float** B, float** S, int size) { const int mask = 0x1F; __m128 v1 = { 0.0, 0.0, 0.0, 0.0 }; __m128 v2 = { 0.0, 0.0, 0.0, 0.0 }; __m128 v3 = { 0.0, 0.0, 0.0, 0.0 }; __m128 v4 = { 0.0, 0.0, 0.0, 0.0 }; __m128 s = { 0.0, 0.0, 0.0, 0.0 }; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { for (int k = 0; k < size; k += 8) { v1 = _mm_set_ps(A[i][k], A[i][k + 1], A[i][k + 2], A[i][k + 3]); v2 = _mm_set_ps(A[i][k + 4], A[i][k + 5], A[i][k + 6], A[i][k + 7]); v3 = _mm_set_ps(B[i][k], B[i][k + 1], B[i][k + 2], B[i][k + 3]); v4 = _mm_set_ps(B[i][k + 4], B[i][k + 5], B[i][k + 6], B[i][k + 7]); s = _mm_dp_ps(v1, v3, mask); S[i][j] += s[0] + s[1] + s[2] + s[3]; s = _mm_dp_ps(v2, v4, mask); S[i][j] += s[0] + s[1] + s[2] + s[3]; } } } } |
And simple test on swift to use that code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
import Foundation func newMatrix<T>(rows: Int, columns: Int, repeated: T) -> UnsafeMutablePointer<UnsafeMutablePointer<T>?> { let matrix : UnsafeMutablePointer<UnsafeMutablePointer<T>?> = UnsafeMutablePointer<UnsafeMutablePointer<T>?>.allocate(capacity:rows) var vectorBuf : UnsafeMutablePointer<T>? vectorBuf = UnsafeMutablePointer<T>.allocate(capacity: columns) for index in 0...(columns) { vectorBuf!.advanced(by: index).pointee = repeated } for index in 0...Int(rows) { matrix.advanced(by: index).pointee = UnsafeMutablePointer<T>.allocate(capacity: (columns)) matrix.advanced(by: index).pointee?.assign(from: vectorBuf!, count: columns) } return matrix } let size : Int = 28 let startTime = Date().timeIntervalSinceReferenceDate let matrixA = newMatrix(rows: size, columns: size, repeated: Float(0.001)) let matrixB = newMatrix(rows: size, columns: size, repeated: Float(0.002)) let matrixC = newMatrix(rows: size, columns: size, repeated: Float(0.0)) matrix_multiplication(matrixA, matrixB, matrixC, Int32(size)) let endTime = Date().timeIntervalSinceReferenceDate print("Spent: \(endTime - startTime)") for r in 0..<size { for c in 0..<size { print(matrixA[r]?[c]) print(matrixB[r]?[c]) print(matrixC[r]?[c]) } } |
Don’t forget to compile that code with the right optimization level and turn on SSE instructions. For example:
1 |
swift build -Xcc -O3 -Xcc -msse4.1 |
As a result you have two cross platform algorithms with hight performance.
One more way is a GPU computing. I will write about it in the next post.
Author: Volodymyr Pavliukevych
Senior Software Engineer, Data Scientist.