In a previous
post I
introduced the MNIST dataset and the problem of classifying handwritten
digits. In this post I’ll be using the code I wrote in that post to port a
simple neural network implementation to rust. My goal is to explore performance
and ergonomics for data science workflows in rust.
The Python Implementation
Chapter 1 of the book
describes a very simple single-layer Neural Network that can classify
handwritten digits from the MNIST dataset
using a learning algorithm based on stochastic gradient
descent . This sounds
complicated — and it kind of is, this stuff was state-of-the-art in the mid
1980s — but really it all comes down to about 150 lines of heavily commented
Python
code .
I’m going to assume that you already know the content of that chapter so stop
here and go read that if you want to brush up on neural network basics. Or don’t
and just pay attention to the code, it’s not super important to understand the
details of exactly why the code works the way it does to see the differences
between the Python approach and the Rust approach.
The fundamental data container in this code is a Network
class that represents
a neural network with a user-controllable number of layers and number of neurons
per layer. The data for the Network
class are represented internally as lists
of 2D NumPy arrays. Each layer of the network is represented as a 2D array of
weights and 1D array of biases, contained in attributes of the Network
class
named biases
and weights
. These are both lists of 2D arrays. The biases
are column vectors but are still stored as 2D arrays by making use of a dummy
dimension .
The initializer for the Network
class looks like this:
class Network (object):
def __init__(self, sizes):
"""The list ``sizes`` contains the number of neurons in the
respective layers of the network. For example, if the list
was [2, 3, 1] then it would be a three-layer network, with the
first layer containing 2 neurons, the second layer 3 neurons,
and the third layer 1 neuron. The biases and weights for the
network are initialized randomly, using a Gaussian
distribution with mean 0, and variance 1. Note that the first
layer is assumed to be an input layer, and by convention we
won't set any biases for those neurons, since biases are only
ever used in computing the outputs from later layers."""
self. num_layers = len(sizes)
self. sizes = sizes
self. biases = [np. random. randn(y, 1 ) for y in sizes[1 :]]
self. weights = [np. random. randn(y, x)
for x, y in zip(sizes[:- 1 ], sizes[1 :])]
In this simple implementation the weights and biases are initialized by drawing
from the standard normal distribution — a normal distribution with a mean of
zero, standard deviation of 1. We can also see how the biases are explicitly
initialized as column vectors.
The Network
class exposes two methods that users would call directly. First,
the evaluate
method, which asks the network to try to identify the digits in a
set of test images and then scores the result based on the a priori known
correct answer. Second, the SGD
method runs a stochastic gradient descent
learning procedure by iterating over a set of images, breaking up the full set
of images into small mini-batches, updating the network’s state based on each
mini-batch of images and a user-specifiable learning rate, eta
, and then
re-running the training procedure for a new randomly selected set of
mini-batches for a user-specifiable number of epochs . The core of the
algorithm, where each mini-batch and the state of the neural network gets
updated, looks like this:
def update_mini_batch (self, mini_batch, eta):
"""Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
is the learning rate."""
nabla_b = [np. zeros(b. shape) for b in self. biases]
nabla_w = [np. zeros(w. shape) for w in self. weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self. backprop(x, y)
nabla_b = [nb+ dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+ dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self. weights = [w- (eta/ len(mini_batch))* nw
for w, nw in zip(self. weights, nabla_w)]
self. biases = [b- (eta/ len(mini_batch))* nb
for b, nb in zip(self. biases, nabla_b)]
For each training image in the mini-batch, we accumulate estimates for the
gradient of the cost function via backpropagation (implemented in the backprop
function). Once we exhaust the mini-batch, we adjust the weights and biases
according to the estimated gradients. The update includes len(mini_batch)
in
the denominator because we want the average gradient over all the estimates in
the mini-batch. We can also control how fast the weights and biases get updated
by adjusting the learning rate, eta
, which globally modulates how big the
updates from each mini-batch can be.
The backprop
function calculates the cost gradient for the neural network by
starting with the expected output of the network given the input image and then
working backward through the network to propagate the error in the network
through the layers. This requires a substantial amount of data munging, and its
where I spent most of my time porting the code to rust but I think it’s a little
too long to dive into in depth here, take a look at chapter
2 of the book if you want
more detail.
The Rust Implementation
The first step here was to figure out how to load the data. That ended up being
fiddly enough that I decided to break that off into its own
post . With that
sorted I then had to figure out how to represent the Python Network
class in
rust. I ended up deciding to use a struct:
use ndarray::Array2;
#[derive(Debug)]
struct Network {
num_layers: usize ,
sizes: Vec< usize > ,
biases: Vec< Array2< f64 >> ,
weights: Vec< Array2< f64 >> ,
}
The struct gets initialized with the number of neurons in each layer in much the
same way as the Python implementation:
use rand::distributions::StandardNormal;
use ndarray::{Array, Array2};
use ndarray_rand::RandomExt;
impl Network {
fn new (sizes: & [usize ]) -> Network {
let num_layers = sizes.len();
let mut biases: Vec< Array2< f64 >> = Vec::new();
let mut weights: Vec< Array2< f64 >> = Vec::new();
for i in 1 ..num_layers {
biases.push(Array::random((sizes[i], 1 ), StandardNormal));
weights.push(Array::random((sizes[i], sizes[i - 1 ]), StandardNormal));
}
Network {
num_layers: num_layers ,
sizes: sizes .to_owned(),
biases: biases ,
weights: weights ,
}
}
}
One difference is that in Python we used
numpy.random.randn
to initialize the biases and weights while in rust we use the
ndarray::Array::random
function which accepts a
rand::distribution::Distribution
as a parameter, allowing the choice of an
arbitrary distribution. In this case we used the
rand::distributions::StandardNormal
distribution. It’s worth noting that this
uses an interface defined in three different crates, two of which — ndarray
itself and ndarray-rand
— are maintained by the ndarray
authors, and
another — rand
— maintained by a different set of developers.
The merits of monolithic packages
In principle it’s nice that random number generation is not isolated inside the
ndarray
codebase and if new random number distributions or capabilities are
added to rand
, ndarray
and all other crates in the rust ecosystem that need
random numbers can benefit equally. On the other hand it does add some cognitive
overhead to need to refer between the documentation for the various crates
instead of having a single centralized place to look. In my particular case I
also got a little unlucky and happened to do this project right after rand
made a release that changed its public API. This led to an incompatibility
between ndarray-rand
, which depended on version 0.6 of rand
, and my project which
declared a dependency on version 0.7.
I’d heard that cargo
and rust’s build system handle this sort of problem
really well but at least in this case I was presented with a confusing error
message about how the random number distribution I was passing in didn’t satisfy
the Distribution
trait. While this is true — it satisfied the Distribution
trait from rand 0.7
but not the one from rand 0.6
that ndarray-rand
expected — it is extremely confusing because the version numbers of the
various crates don’t show up in the error message. I ended up reporting this as
an issue . I discovered
there that these confusing error messages from crates with incompatible APIs is
a long-standing issue for the
rust language. Hopefully in the future rust can grow more helpful error
messages.
In the end this separation of concerns caused a lot of friction for me as a new
user. In Python I could have simply done import numpy
and be done. I do think
that NumPy probably went a bit too far in the direction of being completely
monolithic — it was originally written at a time when packaging and
distributing Python code with C extensions was much harder than it is today —
I do think that going too far in the other extreme can make a language or
ecosystem of tools harder to learn.
Types and ownership
The next bit I’ll show in detail is the rust version of update_mini_batch
:
impl Network {
fn update_mini_batch (
& mut self,
training_data: & [MnistImage],
mini_batch_indices: & [usize ],
eta: f64 ,
) {
let mut nabla_b: Vec< Array2< f64 >> = zero_vec_like(& self.biases);
let mut nabla_w: Vec< Array2< f64 >> = zero_vec_like(& self.weights);
for i in mini_batch_indices {
let (delta_nabla_b, delta_nabla_w) = self.backprop(& training_data[* i]);
for (nb, dnb) in nabla_b.iter_mut().zip(delta_nabla_b.iter()) {
* nb += dnb;
}
for (nw, dnw) in nabla_w.iter_mut().zip(delta_nabla_w.iter()) {
* nw += dnw;
}
}
let nbatch = mini_batch_indices.len() as f64 ;
for (w, nw) in self.weights.iter_mut().zip(nabla_w.iter()) {
* w -= & nw.mapv(| x| x * eta / nbatch);
}
for (b, nb) in self.biases.iter_mut().zip(nabla_b.iter()) {
* b -= & nb.mapv(| x| x * eta / nbatch);
}
}
}
The function makes use of two short helper functions I defined that makes this a little
less verbose:
fn to_tuple (inp: & [usize ]) -> (usize , usize ) {
match inp {
[a, b] => (* a, * b),
_ => panic! (),
}
}
fn zero_vec_like (inp: & [Array2< f64 > ]) -> Vec< Array2< f64 >> {
inp.iter()
.map(| x| Array2::zeros(to_tuple(x.shape())))
.collect()
}
Comparing with the Python implementation the interface for calling
update_mini_batch
is a little different. Rather than passing in a list of
objects directly, instead of I pass in a reference to the full set of training
data and a slice of indices to consider within that full set. This ended up
being a little easier to reason about without triggering the borrow checker.
Creating nabla_b
and nabla_w
in zero_vec_like
is very similar to the list
comprehension we used in Python. There is one wrinkle that caused me some
frustration which is that if I try to create a zero-filled array with
Array2::zeros
and pass it a slice or Vec
for the shape, I get back an ArrayD
instance. To get an Array2
— that is explicitly a 2D array and not a generic
D-dimensional array — I need to pass a tuple to Array::zeros
. However, since
ndarray::shape
returns a slice, I need to convert the slice to a tuple
manually using the to_tuple
function. This sort of thing can be glossed over
in Python but in rust the difference between a tuple and slice can be very
important, as in this API.
The code to estimate the updates for the weights and biases via backpropagation
has a very similar structure to the python implementation. We train each example
image in the mini-batch and obtain estimates for the gradient of the quadratic cost
as a function of the biases and weights:
let (delta_nabla_b, delta_nabla_w) = self.backprop(& training_data[* i]);
and then accumulate these estimates:
for (nb, dnb) in nabla_b.iter_mut().zip(delta_nabla_b.iter()) {
* nb += dnb;
}
for (nw, dnw) in nabla_w.iter_mut().zip(delta_nabla_w.iter()) {
* nw += dnw;
}
Once we’ve finished processing the mini-batch, we update the weights and biases,
modulated by the learning rate:
let nbatch = mini_batch_indices.len() as f64 ;
for (w, nw) in self.weights.iter_mut().zip(nabla_w.iter()) {
* w -= & nw.mapv(| x| x * eta / nbatch);
}
for (b, nb) in self.biases.iter_mut().zip(nabla_b.iter()) {
* b -= & nb.mapv(| x| x * eta / nbatch);
}
This example illustrates how the ergonomics of working with array data is very
different in Rust compared with Python. First, rather than multiplying the array
by the float eta / nbatch
, we instead use Array::mapv
and define a closure
in-line to map in a vectorized manner over the full array. This sort of thing
would not be very fast in Python because function calls are very slow. In rust
it doesn’t make much difference. We also need to borrow the return value of
mapv
with &
when we subtract, lest we consume the array data while we
iterate over it. Needing to think carefully about whether functions consume data
or take references makes it much more conceptually demanding to write code like
this in Rust than in Python. On the other hand I do have much higher confidence
that my code is correct when it compiles. I’m not sure whether the fact that
this code was so demanding for me to write is due to Rust really being harder to
write or the disparity between my experience in Rust and Python.
Rewrite it in rust and everything will be better
At this point I was left with something that was faster than the unoptimized
Python version I had started with. However, instead of a 10x or better speedup
that one might expect moving from a dynamic, interpreted language like Python to
a compiled performance-oriented language like rust, I only observed about a 2x
improvement. To understand why I decided to measure the performance of the rust
code. Luckily there is a very nice project that makes it easy to generate flame
graphs for rust projects:
flamegraph . This adds a
flamegraph
subcommand to cargo
, so one needs only to do cargo flamegraph
in a crate, it will run the code, and then write a flamegraph svg
file one can
inspect with a web browser.
Flame Graph Reset Zoom Search <alloc::vec::Vec<T>>::extend_from_slice (1 samples, 0.02%) <alloc::vec::Vec<T> as alloc::vec::SpecExtend<&'a T, core::slice::Iter<'a, T>>>::spec_extend (1 samples, 0.02%) core::slice::<impl [T]>::copy_from_slice (1 samples, 0.02%) core::intrinsics::copy_nonoverlapping (1 samples, 0.02%) <alloc::vec::Vec<T> as core::clone::Clone>::clone (4 samples, 0.08%) alloc::slice::<impl [T]>::to_vec (2 samples, 0.04%) alloc::slice::hack::to_vec (2 samples, 0.04%) <alloc::vec::Vec<T>>::with_capacity (1 samples, 0.02%) <alloc::raw_vec::RawVec<T>>::with_capacity (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A>>::allocate_in (1 samples, 0.02%) <alloc::alloc::Global as core::alloc::Alloc>::alloc (1 samples, 0.02%) alloc::alloc::alloc (1 samples, 0.02%) <core::iter::adapters::Map<I, F> as core::iter::traits::iterator::Iterator>::fold (1 samples, 0.02%) <ndarray::ArrayBase<S, D>>::unordered_foreach_mut (1 samples, 0.02%) <ndarray::layout::Layout as ndarray::layout::LayoutPriv>::is (1 samples, 0.02%) <ndarray::zip::Zip< (2 samples, 0.04%) <ndarray::zip::Zip<P, D>>::apply_core (2 samples, 0.04%) <ndarray::zip::Zip<P, D>>::apply_core_strided (1 samples, 0.02%) _$LT$ndarray..zip..Zip$LT$$LP$P1$C$$u20$P2$RP$$C$$u20$D$GT$$GT$::apply::_$u7b$$u7b$closure$u7d$$u7d$::h2b06b61a54da618f (1 samples, 0.02%) ndarray::impl_methods::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::zip_mut_with_by_rows::_$u7b$$u7b$closure$u7d$$u7d$::hf8de37080ae87286 (1 samples, 0.02%) <ndarray::zip::Zip< (1 samples, 0.02%) <ndarray::ArrayBase<ndarray::ViewRepr<&'a mut A>, D> as ndarray::zip::NdProducer>::layout (1 samples, 0.02%) ndarray::zip::<impl ndarray::ArrayBase<S, D>>::layout_impl (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout (1 samples, 0.02%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [libc-2.27.so] (5 samples, 0.10%) [libm-2.27.so] (4 samples, 0.08%) __rdl_realloc (1 samples, 0.02%) realloc (1 samples, 0.02%) cfree (1 samples, 0.02%) dgemm_oncopy_HASWELL (1 samples, 0.02%) expf64 (4 samples, 0.08%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout::is_standard_layout (1 samples, 0.02%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Neg for &'a ndarray::ArrayBase<S, D>>::neg (1 samples, 0.02%) ndarray::impl_ops::assign_ops::<impl core::ops::arith::AddAssign<&'a ndarray::ArrayBase<S2, E>> for ndarray::ArrayBase<S, D>>::add_assign (1 samples, 0.02%) [[kernel.kallsyms]] (30 samples, 0.58%) ndarray::linalg::impl_linalg::mat_mul_impl (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) _dl_catch_exception (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) ndarray::impl_ops::assign_ops::<impl core::ops::arith::AddAssign<&'a ndarray::ArrayBase<S2, E>> for ndarray::ArrayBase<S, D>>::add_assign (1 samples, 0.02%) ndarray::linalg::impl_linalg::mat_mul_impl (1 samples, 0.02%) [unknown] (3 samples, 0.06%) nndl_rust::main (1 samples, 0.02%) nndl_rust::Network::sgd (1 samples, 0.02%) nndl_rust::Network::update_mini_batch (1 samples, 0.02%) nndl_rust::Network::backprop (1 samples, 0.02%) <[T] as rand::seq::SliceRandom>::shuffle (2 samples, 0.04%) rand::Rng::gen_range (1 samples, 0.02%) <rand::distributions::uniform::UniformInt<usize> as rand::distributions::uniform::UniformSampler>::sample_single (1 samples, 0.02%) rand::Rng::gen (1 samples, 0.02%) rand::distributions::integer::<impl rand::distributions::Distribution<usize> for rand::distributions::Standard>::sample (1 samples, 0.02%) <rand::rngs::thread::ThreadRng as rand_core::RngCore>::next_u64 (1 samples, 0.02%) <rand::rngs::adapter::reseeding::ReseedingRng<R, Rsdr> as rand_core::RngCore>::next_u64 (1 samples, 0.02%) <rand_core::block::BlockRng<R> as rand_core::RngCore>::next_u64 (1 samples, 0.02%) <rand_core::block::BlockRng<R>>::generate_and_set (1 samples, 0.02%) <rand::rngs::adapter::reseeding::ReseedingCore<R, Rsdr> as rand_core::block::BlockRngCore>::generate (1 samples, 0.02%) <rand_hc::hc128::Hc128Core as rand_core::block::BlockRngCore>::generate (1 samples, 0.02%) rand_hc::hc128::Hc128Core::step_q (1 samples, 0.02%) _$LT$core..iter..adapters..Map$LT$I$C$$u20$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::fold::_$u7b$$u7b$closure$u7d$$u7d$::hae746251c4eeaa35 (1 samples, 0.02%) nndl_rust::Network::evaluate::_$u7b$$u7b$closure$u7d$$u7d$::h567b2c5b99fe1970 (1 samples, 0.02%) nndl_rust::argmax (1 samples, 0.02%) core::ptr::real_drop_in_place (1 samples, 0.02%) core::ptr::real_drop_in_place (1 samples, 0.02%) core::ptr::real_drop_in_place (1 samples, 0.02%) core::ptr::real_drop_in_place (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A> as core::ops::drop::Drop>::drop (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer (1 samples, 0.02%) <alloc::alloc::Global as core::alloc::Alloc>::dealloc (1 samples, 0.02%) alloc::alloc::dealloc (1 samples, 0.02%) cfree (1 samples, 0.02%) <alloc::vec::Vec<T>>::extend_from_slice (5 samples, 0.10%) <alloc::vec::Vec<T> as alloc::vec::SpecExtend<&'a T, core::slice::Iter<'a, T>>>::spec_extend (5 samples, 0.10%) core::slice::<impl [T]>::copy_from_slice (5 samples, 0.10%) core::intrinsics::copy_nonoverlapping (5 samples, 0.10%) [libc-2.27.so] (5 samples, 0.10%) ndarray::impl_clone::<impl core::clone::Clone for ndarray::ArrayBase<S, D>>::clone (9 samples, 0.17%) <ndarray::OwnedRepr<A> as ndarray::data_traits::DataClone>::clone_with_ptr (9 samples, 0.17%) <ndarray::OwnedRepr<A> as core::clone::Clone>::clone (9 samples, 0.17%) <alloc::vec::Vec<T> as core::clone::Clone>::clone (9 samples, 0.17%) alloc::slice::<impl [T]>::to_vec (9 samples, 0.17%) alloc::slice::hack::to_vec (9 samples, 0.17%) <alloc::vec::Vec<T>>::with_capacity (4 samples, 0.08%) <alloc::raw_vec::RawVec<T>>::with_capacity (4 samples, 0.08%) <alloc::raw_vec::RawVec<T, A>>::allocate_in (4 samples, 0.08%) <alloc::alloc::Global as core::alloc::Alloc>::alloc (4 samples, 0.08%) alloc::alloc::alloc (4 samples, 0.08%) __libc_malloc (4 samples, 0.08%) [libc-2.27.so] (3 samples, 0.06%) __GI___pthread_mutex_unlock (1 samples, 0.02%) dgemm_itcopy_HASWELL (3 samples, 0.06%) ndarray::linalg::impl_linalg::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$GT$$GT$::dot::hcef54f2ce75d3905 (182 samples, 3.51%) nda.. _$LT$ndarray..ArrayBase$LT$S$C$$u20$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$GT$$u20$as$u20$ndarray..linalg..impl_linalg..Dot$LT$ndarray..ArrayBase$LT$S2$C$$u20$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$GT$$GT$$GT$::dot::h48a7db9fda0dd69f (182 samples, 3.51%) _$L.. ndarray::linalg::impl_linalg::mat_mul_impl (182 samples, 3.51%) nda.. cblas_dgemm (182 samples, 3.51%) cbl.. dgemm_nn (181 samples, 3.49%) dge.. dgemm_oncopy_HASWELL (176 samples, 3.40%) dge.. [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) nndl_rust::Network::evaluate (198 samples, 3.82%) nndl.. core::iter::traits::iterator::Iterator::collect (198 samples, 3.82%) core.. <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter (198 samples, 3.82%) <all.. <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::from_iter (198 samples, 3.82%) <all.. <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::spec_extend (198 samples, 3.82%) <all.. core::iter::traits::iterator::Iterator::for_each (198 samples, 3.82%) core.. <core::iter::adapters::Map<I, F> as core::iter::traits::iterator::Iterator>::fold (198 samples, 3.82%) <cor.. <core::iter::adapters::Map<I, F> as core::iter::traits::iterator::Iterator>::fold (198 samples, 3.82%) <cor.. <core::slice::Iter<'a, T> as core::iter::traits::iterator::Iterator>::fold (198 samples, 3.82%) <cor.. _$LT$core..iter..adapters..Map$LT$I$C$$u20$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::fold::_$u7b$$u7b$closure$u7d$$u7d$::h579764e055d58574 (198 samples, 3.82%) _$LT.. nndl_rust::Network::evaluate::_$u7b$$u7b$closure$u7d$$u7d$::ha9023cb2fcfde6e0 (197 samples, 3.80%) nndl.. nndl_rust::Network::feedforward (197 samples, 3.80%) nndl.. nndl_rust::sigmoid (5 samples, 0.10%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::mapv (5 samples, 0.10%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::map (5 samples, 0.10%) ndarray::iterators::to_vec_mapped (5 samples, 0.10%) <core::slice::Iter<'a, T> as core::iter::traits::iterator::Iterator>::fold (5 samples, 0.10%) ndarray::iterators::to_vec_mapped::_$u7b$$u7b$closure$u7d$$u7d$::h96d4b65474d2726b (5 samples, 0.10%) ndarray::impl_methods::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::mapv::_$u7b$$u7b$closure$u7d$$u7d$::h8da8f33936077f61 (5 samples, 0.10%) core::ops::function::FnMut::call_mut (5 samples, 0.10%) std::f64::<impl f64>::exp (5 samples, 0.10%) expf64 (5 samples, 0.10%) [libm-2.27.so] (5 samples, 0.10%) <core::iter::adapters::zip::Zip<A, B> as core::iter::traits::iterator::Iterator>::next (1 samples, 0.02%) <core::iter::adapters::zip::Zip<A, B> as core::iter::adapters::zip::ZipImpl<A, B>>::next (1 samples, 0.02%) <alloc::vec::Vec<T> as core::ops::drop::Drop>::drop (7 samples, 0.14%) core::ptr::drop_in_place (7 samples, 0.14%) core::ptr::real_drop_in_place (7 samples, 0.14%) core::ptr::real_drop_in_place (5 samples, 0.10%) core::ptr::real_drop_in_place (5 samples, 0.10%) core::ptr::real_drop_in_place (5 samples, 0.10%) core::ptr::real_drop_in_place (5 samples, 0.10%) <alloc::raw_vec::RawVec<T, A> as core::ops::drop::Drop>::drop (5 samples, 0.10%) <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer (5 samples, 0.10%) <alloc::alloc::Global as core::alloc::Alloc>::dealloc (5 samples, 0.10%) alloc::alloc::dealloc (5 samples, 0.10%) cfree (5 samples, 0.10%) <alloc::raw_vec::RawVec<T, A> as core::ops::drop::Drop>::drop (4 samples, 0.08%) <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer (4 samples, 0.08%) <alloc::alloc::Global as core::alloc::Alloc>::dealloc (4 samples, 0.08%) alloc::alloc::dealloc (4 samples, 0.08%) cfree (4 samples, 0.08%) core::ptr::real_drop_in_place (12 samples, 0.23%) core::ptr::real_drop_in_place (5 samples, 0.10%) core::ptr::real_drop_in_place (1 samples, 0.02%) core::ptr::real_drop_in_place (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A> as core::ops::drop::Drop>::drop (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer (1 samples, 0.02%) <alloc::alloc::Global as core::alloc::Alloc>::dealloc (1 samples, 0.02%) alloc::alloc::dealloc (1 samples, 0.02%) cfree (1 samples, 0.02%) core::ptr::write (4 samples, 0.08%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::mapv (39 samples, 0.75%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::map (39 samples, 0.75%) ndarray::iterators::to_vec_mapped (39 samples, 0.75%) <core::slice::Iter<'a, T> as core::iter::traits::iterator::Iterator>::fold (39 samples, 0.75%) ndarray::iterators::to_vec_mapped::_$u7b$$u7b$closure$u7d$$u7d$::hf72997fece11f5d2 (34 samples, 0.66%) ndarray::impl_methods::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::mapv::_$u7b$$u7b$closure$u7d$$u7d$::h0cb98e0c6ec84984 (30 samples, 0.58%) nndl_rust::Network::update_mini_batch::_$u7b$$u7b$closure$u7d$$u7d$::h016be2fa699b1e5a (30 samples, 0.58%) core::cmp::impls::<impl core::cmp::PartialEq<&'b B> for &'a A>::eq (2 samples, 0.04%) core::slice::<impl core::cmp::PartialEq<[B]> for [A]>::eq (2 samples, 0.04%) <[A] as core::slice::SlicePartialEq<A>>::equal (2 samples, 0.04%) core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (2 samples, 0.04%) <usize as core::iter::range::Step>::add_usize (2 samples, 0.04%) core::num::<impl usize>::checked_add (2 samples, 0.04%) core::num::<impl usize>::overflowing_add (2 samples, 0.04%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::as_slice (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout::is_standard_layout (1 samples, 0.02%) < (3 samples, 0.06%) <*mut T as ndarray::zip::Offset>::stride_offset (3 samples, 0.06%) core::iter::traits::iterator::Iterator::try_for_each::_$u7b$$u7b$closure$u7d$$u7d$::h389a9ec772195211 (1 samples, 0.02%) core::iter::traits::iterator::Iterator::any::_$u7b$$u7b$closure$u7d$$u7d$::h1937c0298809360a (1 samples, 0.02%) ndarray::impl_methods::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::is_standard_layout::is_standard_layout::_$u7b$$u7b$closure$u7d$$u7d$::h4f531127c4e24681 (1 samples, 0.02%) core::iter::traits::iterator::Iterator::any (4 samples, 0.08%) core::iter::traits::iterator::Iterator::try_for_each (4 samples, 0.08%) <core::slice::Iter<'a, T> as core::iter::traits::iterator::Iterator>::try_fold (4 samples, 0.08%) core::num::<impl usize>::wrapping_sub (1 samples, 0.02%) <ndarray::ArrayBase<ndarray::ViewRepr<&'a A>, D> as ndarray::zip::NdProducer>::layout (8 samples, 0.15%) ndarray::zip::<impl ndarray::ArrayBase<S, D>>::layout_impl (8 samples, 0.15%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout (8 samples, 0.15%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout::is_standard_layout (8 samples, 0.15%) core::slice::<impl [T]>::iter (1 samples, 0.02%) core::ptr::<impl *const T>::add (1 samples, 0.02%) core::ptr::<impl *const T>::offset (1 samples, 0.02%) <ndarray::ArrayBase<ndarray::ViewRepr<&'a mut A>, D> as ndarray::zip::NdProducer>::layout (1 samples, 0.02%) ndarray::zip::<impl ndarray::ArrayBase<S, D>>::layout_impl (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout::is_standard_layout (1 samples, 0.02%) _$LT$ndarray..zip..Zip$LT$$LP$P1$C$$u20$P2$RP$$C$$u20$D$GT$$GT$::apply::_$u7b$$u7b$closure$u7d$$u7d$::h9bb178e8bf662d9d (789 samples, 15.22%) _$LT$ndarray..zip..Zip$.. ndarray::impl_methods::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::zip_mut_with_by_rows::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::hd53040d6f33bcf37 (789 samples, 15.22%) ndarray::impl_methods::.. ndarray::impl_ops::assign_ops::_$LT$impl$u20$core..ops..arith..AddAssign$LT$$RF$$u27$a$u20$ndarray..ArrayBase$LT$S2$C$$u20$E$GT$$GT$$u20$for$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::add_assign::_$u7b$$u7b$closure$u7d$$u7d$::h86a858ad52faeace (789 samples, 15.22%) ndarray::impl_ops::assi.. <f64 as core::ops::arith::AddAssign>::add_assign (789 samples, 15.22%) <f64 as core::ops::arit.. [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) <usize as core::iter::range::Step>::add_usize (5 samples, 0.10%) core::num::<impl usize>::checked_add (5 samples, 0.10%) core::num::<impl usize>::overflowing_add (5 samples, 0.10%) _$LT$ndarray..zip..Zip$LT$$LP$P1$C$$u20$P2$RP$$C$$u20$D$GT$$GT$::apply::_$u7b$$u7b$closure$u7d$$u7d$::h2b06b61a54da618f (1,195 samples, 23.06%) _$LT$ndarray..zip..Zip$LT$$LP$P1$C$$.. ndarray::impl_methods::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::zip_mut_with_by_rows::_$u7b$$u7b$closure$u7d$$u7d$::hf8de37080ae87286 (1,195 samples, 23.06%) ndarray::impl_methods::_$LT$impl$u20.. <ndarray::zip::Zip< (1,195 samples, 23.06%) <ndarray::zip::Zip< <ndarray::zip::Zip<P, D>>::apply_core (1,184 samples, 22.84%) <ndarray::zip::Zip<P, D>>::apply_core <ndarray::zip::Zip<P, D>>::apply_core_strided (1,183 samples, 22.82%) <ndarray::zip::Zip<P, D>>::apply_cor.. core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (135 samples, 2.60%) co.. core::cmp::impls::<impl core::cmp::PartialOrd for usize>::lt (77 samples, 1.49%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::zip_mut_with_by_rows (1,203 samples, 23.21%) ndarray::impl_methods::<impl ndarray.. <ndarray::zip::Zip< (1,203 samples, 23.21%) <ndarray::zip::Zip< <ndarray::zip::Zip<P, D>>::apply_core (1,202 samples, 23.19%) <ndarray::zip::Zip<P, D>>::apply_core <ndarray::zip::Zip<P, D>>::apply_core_strided (1,202 samples, 23.19%) <ndarray::zip::Zip<P, D>>::apply_cor.. core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (3 samples, 0.06%) <usize as core::iter::range::Step>::add_usize (2 samples, 0.04%) core::num::<impl usize>::checked_add (2 samples, 0.04%) core::num::<impl usize>::overflowing_add (2 samples, 0.04%) ndarray::impl_ops::assign_ops::<impl core::ops::arith::AddAssign<&'a ndarray::ArrayBase<S2, E>> for ndarray::ArrayBase<S, D>>::add_assign (1,213 samples, 23.40%) ndarray::impl_ops::assign_ops::<impl.. ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::zip_mut_with (1,213 samples, 23.40%) ndarray::impl_methods::<impl ndarray.. ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::zip_mut_with_same_shape (1,211 samples, 23.36%) ndarray::impl_methods::<impl ndarray.. ndarray::impl_ops::assign_ops::_$LT$impl$u20$core..ops..arith..AddAssign$LT$$RF$$u27$a$u20$ndarray..ArrayBase$LT$S2$C$$u20$E$GT$$GT$$u20$for$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::add_assign::_$u7b$$u7b$closure$u7d$$u7d$::h86a858ad52faeace (4 samples, 0.08%) <f64 as core::ops::arith::AddAssign>::add_assign (4 samples, 0.08%) core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (18 samples, 0.35%) <usize as core::iter::range::Step>::add_usize (18 samples, 0.35%) core::num::<impl usize>::checked_add (18 samples, 0.35%) core::num::<impl usize>::overflowing_add (18 samples, 0.35%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::as_slice_mut (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout::is_standard_layout (1 samples, 0.02%) _$LT$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$u20$as$u20$ndarray..dimension..dimension_trait..Dimension$GT$::default_strides::h0e83b923d9e9a1fd (1 samples, 0.02%) ndarray::impl_ops::assign_ops::<impl core::ops::arith::SubAssign<&'a ndarray::ArrayBase<S2, E>> for ndarray::ArrayBase<S, D>>::sub_assign (34 samples, 0.66%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::zip_mut_with (34 samples, 0.66%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::zip_mut_with_same_shape (34 samples, 0.66%) ndarray::impl_ops::assign_ops::_$LT$impl$u20$core..ops..arith..SubAssign$LT$$RF$$u27$a$u20$ndarray..ArrayBase$LT$S2$C$$u20$E$GT$$GT$$u20$for$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::sub_assign::_$u7b$$u7b$closure$u7d$$u7d$::h54a8f19d90774cbf (15 samples, 0.29%) <f64 as core::ops::arith::SubAssign>::sub_assign (15 samples, 0.29%) <alloc::vec::Vec<T> as core::ops::index::IndexMut<I>>::index_mut (1 samples, 0.02%) core::slice::<impl core::ops::index::IndexMut<I> for [T]>::index_mut (1 samples, 0.02%) <usize as core::slice::SliceIndex<[T]>>::index_mut (1 samples, 0.02%) <alloc::vec::Vec<T>>::push (18 samples, 0.35%) <alloc::vec::Vec<T>>::reserve (17 samples, 0.33%) <alloc::raw_vec::RawVec<T, A>>::reserve (17 samples, 0.33%) <alloc::raw_vec::RawVec<T, A>>::reserve_internal (17 samples, 0.33%) <alloc::alloc::Global as core::alloc::Alloc>::realloc (17 samples, 0.33%) alloc::alloc::realloc (17 samples, 0.33%) realloc (17 samples, 0.33%) [libc-2.27.so] (16 samples, 0.31%) [libc-2.27.so] (13 samples, 0.25%) alloc::alloc::exchange_malloc (3 samples, 0.06%) alloc::alloc::alloc (3 samples, 0.06%) __libc_malloc (3 samples, 0.06%) <alloc::vec::Vec<T>>::reserve (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A>>::reserve (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A>>::reserve_internal (1 samples, 0.02%) <alloc::alloc::Global as core::alloc::Alloc>::alloc (1 samples, 0.02%) alloc::alloc::alloc (1 samples, 0.02%) __libc_malloc (1 samples, 0.02%) core::iter::traits::iterator::Iterator::for_each::_$u7b$$u7b$closure$u7d$$u7d$::h84efd45fb6846f5c (1 samples, 0.02%) _$LT$alloc..vec..Vec$LT$T$GT$$u20$as$u20$alloc..vec..SpecExtend$LT$T$C$$u20$I$GT$$GT$::spec_extend::_$u7b$$u7b$closure$u7d$$u7d$::h9c3f264dad5f87b2 (1 samples, 0.02%) core::ptr::write (1 samples, 0.02%) <T as ndarray::shape_builder::ShapeBuilder>::into_shape (1 samples, 0.02%) alloc::vec::from_elem (7 samples, 0.14%) <T as alloc::vec::SpecFromElem>::from_elem (7 samples, 0.14%) <alloc::raw_vec::RawVec<T>>::with_capacity_zeroed (6 samples, 0.12%) <alloc::raw_vec::RawVec<T, A>>::allocate_in (6 samples, 0.12%) <alloc::alloc::Global as core::alloc::Alloc>::alloc_zeroed (6 samples, 0.12%) alloc::alloc::alloc_zeroed (6 samples, 0.12%) __libc_calloc (6 samples, 0.12%) [libc-2.27.so] (4 samples, 0.08%) _$LT$core..iter..adapters..Map$LT$I$C$$u20$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::fold::_$u7b$$u7b$closure$u7d$$u7d$::h3c2405ddaad01b4f (13 samples, 0.25%) nndl_rust::Network::backprop::_$u7b$$u7b$closure$u7d$$u7d$::hf6708644d898f30b (12 samples, 0.23%) ndarray::impl_constructors::<impl ndarray::ArrayBase<S, D>>::zeros (12 samples, 0.23%) ndarray::impl_constructors::<impl ndarray::ArrayBase<S, D>>::from_elem (12 samples, 0.23%) ndarray::dimension::size_of_shape_checked (1 samples, 0.02%) core::slice::<impl [T]>::iter (1 samples, 0.02%) core::ptr::<impl *const T>::add (1 samples, 0.02%) core::ptr::<impl *const T>::offset (1 samples, 0.02%) core::iter::traits::iterator::Iterator::for_each::_$u7b$$u7b$closure$u7d$$u7d$::h196eb9eb7e8d762e (1 samples, 0.02%) _$LT$alloc..vec..Vec$LT$T$GT$$u20$as$u20$alloc..vec..SpecExtend$LT$T$C$$u20$I$GT$$GT$::spec_extend::_$u7b$$u7b$closure$u7d$$u7d$::h9e5ebf4ec37dad8e (1 samples, 0.02%) core::ptr::write (1 samples, 0.02%) core::iter::traits::iterator::Iterator::collect (308 samples, 5.94%) core::i.. <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter (308 samples, 5.94%) <alloc:.. <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::from_iter (308 samples, 5.94%) <alloc:.. <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::spec_extend (308 samples, 5.94%) <alloc:.. core::iter::traits::iterator::Iterator::for_each (307 samples, 5.92%) core::i.. <core::iter::adapters::Map<I, F> as core::iter::traits::iterator::Iterator>::fold (307 samples, 5.92%) <core::.. <core::slice::Iter<'a, T> as core::iter::traits::iterator::Iterator>::fold (306 samples, 5.90%) <core::.. _$LT$core..iter..adapters..Map$LT$I$C$$u20$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::fold::_$u7b$$u7b$closure$u7d$$u7d$::hcf8d84009fd76c57 (293 samples, 5.65%) _$LT$co.. nndl_rust::Network::backprop::_$u7b$$u7b$closure$u7d$$u7d$::hc025177ae5e59770 (292 samples, 5.63%) nndl_ru.. ndarray::impl_constructors::<impl ndarray::ArrayBase<S, D>>::zeros (292 samples, 5.63%) ndarray.. ndarray::impl_constructors::<impl ndarray::ArrayBase<S, D>>::from_elem (292 samples, 5.63%) ndarray.. alloc::vec::from_elem (289 samples, 5.58%) alloc::.. <T as alloc::vec::SpecFromElem>::from_elem (289 samples, 5.58%) <T as a.. <alloc::raw_vec::RawVec<T>>::with_capacity_zeroed (289 samples, 5.58%) <alloc:.. <alloc::raw_vec::RawVec<T, A>>::allocate_in (289 samples, 5.58%) <alloc:.. <alloc::alloc::Global as core::alloc::Alloc>::alloc_zeroed (288 samples, 5.56%) <alloc:.. alloc::alloc::alloc_zeroed (288 samples, 5.56%) alloc::.. __libc_calloc (288 samples, 5.56%) __libc_.. [libc-2.27.so] (286 samples, 5.52%) [libc-2.. [libc-2.27.so] (2 samples, 0.04%) __rust_dealloc (1 samples, 0.02%) <alloc::vec::Vec<T> as core::ops::drop::Drop>::drop (5 samples, 0.10%) core::ptr::drop_in_place (5 samples, 0.10%) core::ptr::real_drop_in_place (5 samples, 0.10%) core::ptr::real_drop_in_place (5 samples, 0.10%) core::ptr::real_drop_in_place (5 samples, 0.10%) core::ptr::real_drop_in_place (5 samples, 0.10%) core::ptr::real_drop_in_place (5 samples, 0.10%) <alloc::raw_vec::RawVec<T, A> as core::ops::drop::Drop>::drop (5 samples, 0.10%) <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer (5 samples, 0.10%) <alloc::alloc::Global as core::alloc::Alloc>::dealloc (5 samples, 0.10%) alloc::alloc::dealloc (5 samples, 0.10%) cfree (4 samples, 0.08%) __rust_dealloc (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A> as core::ops::drop::Drop>::drop (2 samples, 0.04%) <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer (2 samples, 0.04%) <alloc::alloc::Global as core::alloc::Alloc>::dealloc (2 samples, 0.04%) alloc::alloc::dealloc (2 samples, 0.04%) cfree (1 samples, 0.02%) core::ptr::real_drop_in_place (14 samples, 0.27%) core::ptr::real_drop_in_place (9 samples, 0.17%) core::ptr::real_drop_in_place (7 samples, 0.14%) core::ptr::real_drop_in_place (7 samples, 0.14%) <alloc::raw_vec::RawVec<T, A> as core::ops::drop::Drop>::drop (7 samples, 0.14%) <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer (7 samples, 0.14%) <alloc::alloc::Global as core::alloc::Alloc>::dealloc (7 samples, 0.14%) alloc::alloc::dealloc (7 samples, 0.14%) cfree (7 samples, 0.14%) <alloc::vec::Vec<T>>::extend_from_slice (46 samples, 0.89%) <alloc::vec::Vec<T> as alloc::vec::SpecExtend<&'a T, core::slice::Iter<'a, T>>>::spec_extend (46 samples, 0.89%) core::slice::<impl [T]>::copy_from_slice (46 samples, 0.89%) core::intrinsics::copy_nonoverlapping (46 samples, 0.89%) [libc-2.27.so] (46 samples, 0.89%) <alloc::alloc::Global as core::alloc::Alloc>::alloc (12 samples, 0.23%) alloc::alloc::alloc (12 samples, 0.23%) __libc_malloc (12 samples, 0.23%) [libc-2.27.so] (3 samples, 0.06%) ndarray::impl_clone::<impl core::clone::Clone for ndarray::ArrayBase<S, D>>::clone (63 samples, 1.22%) <ndarray::OwnedRepr<A> as ndarray::data_traits::DataClone>::clone_with_ptr (60 samples, 1.16%) <ndarray::OwnedRepr<A> as core::clone::Clone>::clone (60 samples, 1.16%) <alloc::vec::Vec<T> as core::clone::Clone>::clone (60 samples, 1.16%) alloc::slice::<impl [T]>::to_vec (59 samples, 1.14%) alloc::slice::hack::to_vec (59 samples, 1.14%) <alloc::vec::Vec<T>>::with_capacity (13 samples, 0.25%) <alloc::raw_vec::RawVec<T>>::with_capacity (13 samples, 0.25%) <alloc::raw_vec::RawVec<T, A>>::allocate_in (13 samples, 0.25%) core::num::<impl usize>::checked_mul (1 samples, 0.02%) core::num::<impl usize>::overflowing_mul (1 samples, 0.02%) core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (3 samples, 0.06%) <usize as core::iter::range::Step>::add_usize (3 samples, 0.06%) core::num::<impl usize>::checked_add (3 samples, 0.06%) core::num::<impl usize>::overflowing_add (3 samples, 0.06%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::as_slice_mut (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout::is_standard_layout (1 samples, 0.02%) core::iter::traits::iterator::Iterator::any (1 samples, 0.02%) core::iter::traits::iterator::Iterator::try_for_each (1 samples, 0.02%) <core::slice::Iter<'a, T> as core::iter::traits::iterator::Iterator>::try_fold (1 samples, 0.02%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Add<&'a ndarray::ArrayBase<S2, E>> for ndarray::ArrayBase<S, D>>::add (5 samples, 0.10%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::zip_mut_with (5 samples, 0.10%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::zip_mut_with_same_shape (5 samples, 0.10%) ndarray::impl_ops::arithmetic_ops::_$LT$impl$u20$core..ops..arith..Add$LT$$RF$$u27$a$u20$ndarray..ArrayBase$LT$S2$C$$u20$E$GT$$GT$$u20$for$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::add::_$u7b$$u7b$closure$u7d$$u7d$::hdb62ebe110047a82 (1 samples, 0.02%) <f64 as core::ops::arith::Add>::add (1 samples, 0.02%) core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (1 samples, 0.02%) <usize as core::iter::range::Step>::add_usize (1 samples, 0.02%) core::num::<impl usize>::checked_add (1 samples, 0.02%) core::num::<impl usize>::overflowing_add (1 samples, 0.02%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Mul<ndarray::ArrayBase<S2, E>> for ndarray::ArrayBase<S, D>>::mul (3 samples, 0.06%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Mul<&'a ndarray::ArrayBase<S2, E>> for ndarray::ArrayBase<S, D>>::mul (3 samples, 0.06%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::zip_mut_with (2 samples, 0.04%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::zip_mut_with_same_shape (2 samples, 0.04%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::as_slice_mut (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout::is_standard_layout (1 samples, 0.02%) core::iter::traits::iterator::Iterator::any (1 samples, 0.02%) core::iter::traits::iterator::Iterator::try_for_each (1 samples, 0.02%) <core::slice::Iter<'a, T> as core::iter::traits::iterator::Iterator>::try_fold (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::dim (1 samples, 0.02%) <ndarray::dimension::dim::Dim<I> as core::clone::Clone>::clone (1 samples, 0.02%) core::clone::Clone::clone (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::view_mut (1 samples, 0.02%) ndarray::impl_views::<impl ndarray::ArrayBase<ndarray::ViewRepr<&'a mut A>, D>>::new_ (1 samples, 0.02%) __GI___pthread_mutex_unlock (1 samples, 0.02%) blas_memory_alloc (4 samples, 0.08%) __GI___pthread_mutex_lock (1 samples, 0.02%) blas_memory_free (2 samples, 0.04%) __GI___pthread_mutex_lock (1 samples, 0.02%) dgemm_itcopy_HASWELL (12 samples, 0.23%) ndarray::linalg::impl_linalg::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$GT$$GT$::dot::h6d56249ebe4e2f12 (25 samples, 0.48%) _$LT$ndarray..ArrayBase$LT$S$C$$u20$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$GT$$u20$as$u20$ndarray..linalg..impl_linalg..Dot$LT$ndarray..ArrayBase$LT$S2$C$$u20$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$GT$$GT$$GT$::dot::hc6e58547238892a7 (25 samples, 0.48%) ndarray::linalg::impl_linalg::mat_mul_impl (23 samples, 0.44%) cblas_dgemm (23 samples, 0.44%) dgemm_nn (15 samples, 0.29%) dgemm_oncopy_HASWELL (1 samples, 0.02%) <alloc::alloc::Global as core::alloc::Alloc>::alloc (2 samples, 0.04%) alloc::alloc::alloc (2 samples, 0.04%) __libc_malloc (2 samples, 0.04%) <alloc::vec::Vec<T>>::with_capacity (4 samples, 0.08%) <alloc::raw_vec::RawVec<T>>::with_capacity (4 samples, 0.08%) <alloc::raw_vec::RawVec<T, A>>::allocate_in (4 samples, 0.08%) core::num::<impl usize>::checked_mul (2 samples, 0.04%) core::num::<impl usize>::overflowing_mul (2 samples, 0.04%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::view (1 samples, 0.02%) __GI___pthread_mutex_unlock (1 samples, 0.02%) __GI___pthread_mutex_lock (4 samples, 0.08%) blas_memory_alloc (7 samples, 0.14%) pthread_mutex_lock@plt (1 samples, 0.02%) blas_memory_free (3 samples, 0.06%) __GI___pthread_mutex_lock (2 samples, 0.04%) dgemm_itcopy_HASWELL (33 samples, 0.64%) ndarray::linalg::impl_linalg::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$GT$$GT$::dot::hcef54f2ce75d3905 (1,109 samples, 21.40%) ndarray::linalg::impl_linalg::_$L.. _$LT$ndarray..ArrayBase$LT$S$C$$u20$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$GT$$u20$as$u20$ndarray..linalg..impl_linalg..Dot$LT$ndarray..ArrayBase$LT$S2$C$$u20$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$GT$$GT$$GT$::dot::h48a7db9fda0dd69f (1,109 samples, 21.40%) _$LT$ndarray..ArrayBase$LT$S$C$$u.. ndarray::linalg::impl_linalg::mat_mul_impl (1,104 samples, 21.30%) ndarray::linalg::impl_linalg::mat.. cblas_dgemm (1,104 samples, 21.30%) cblas_dgemm dgemm_nn (1,090 samples, 21.03%) dgemm_nn dgemm_oncopy_HASWELL (1,042 samples, 20.10%) dgemm_oncopy_HASWELL [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) <alloc::alloc::Global as core::alloc::Alloc>::alloc (10 samples, 0.19%) alloc::alloc::alloc (10 samples, 0.19%) __libc_malloc (10 samples, 0.19%) [libc-2.27.so] (7 samples, 0.14%) <alloc::vec::Vec<T>>::with_capacity (12 samples, 0.23%) <alloc::raw_vec::RawVec<T>>::with_capacity (11 samples, 0.21%) <alloc::raw_vec::RawVec<T, A>>::allocate_in (11 samples, 0.21%) core::num::<impl usize>::checked_mul (1 samples, 0.02%) core::num::<impl usize>::overflowing_mul (1 samples, 0.02%) __GI___pthread_mutex_unlock (1 samples, 0.02%) blas_memory_alloc (5 samples, 0.10%) blas_memory_free (5 samples, 0.10%) __GI___pthread_mutex_lock (3 samples, 0.06%) dgemm_itcopy_HASWELL (2 samples, 0.04%) ndarray::linalg::impl_linalg::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$GT$$GT$::dot::hd1efb90db040728a (70 samples, 1.35%) _$LT$ndarray..ArrayBase$LT$S$C$$u20$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$GT$$u20$as$u20$ndarray..linalg..impl_linalg..Dot$LT$ndarray..ArrayBase$LT$S2$C$$u20$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$GT$$GT$$GT$::dot::h9a17b0f2ea36613c (70 samples, 1.35%) ndarray::linalg::impl_linalg::mat_mul_impl (57 samples, 1.10%) cblas_dgemm (57 samples, 1.10%) dgemm_nn (44 samples, 0.85%) dgemm_oncopy_HASWELL (29 samples, 0.56%) nndl_rust::Network::cost_derivative (2 samples, 0.04%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Sub<&'a ndarray::ArrayBase<S2, E>> for &'a ndarray::ArrayBase<S, D>>::sub (2 samples, 0.04%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Sub<&'a ndarray::ArrayBase<S2, E>> for ndarray::ArrayBase<S, D>>::sub (2 samples, 0.04%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::zip_mut_with (1 samples, 0.02%) core::cmp::impls::<impl core::cmp::PartialEq<&'b B> for &'a A>::eq (1 samples, 0.02%) core::slice::<impl core::cmp::PartialEq<[B]> for [A]>::eq (1 samples, 0.02%) <[A] as core::slice::SlicePartialEq<A>>::equal (1 samples, 0.02%) core::ptr::real_drop_in_place (3 samples, 0.06%) core::ptr::real_drop_in_place (3 samples, 0.06%) core::ptr::real_drop_in_place (3 samples, 0.06%) core::ptr::real_drop_in_place (3 samples, 0.06%) <alloc::raw_vec::RawVec<T, A> as core::ops::drop::Drop>::drop (3 samples, 0.06%) <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer (3 samples, 0.06%) <alloc::alloc::Global as core::alloc::Alloc>::dealloc (3 samples, 0.06%) alloc::alloc::dealloc (3 samples, 0.06%) cfree (3 samples, 0.06%) ndarray::impl_constructors::<impl ndarray::ArrayBase<S, D>>::from_shape_vec_unchecked (1 samples, 0.02%) ndarray::impl_constructors::<impl ndarray::ArrayBase<S, D>>::from_vec_dim_stride_unchecked (1 samples, 0.02%) @plt (3 samples, 0.06%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::mapv (36 samples, 0.69%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::map (36 samples, 0.69%) ndarray::iterators::to_vec_mapped (35 samples, 0.68%) <core::slice::Iter<'a, T> as core::iter::traits::iterator::Iterator>::fold (35 samples, 0.68%) ndarray::iterators::to_vec_mapped::_$u7b$$u7b$closure$u7d$$u7d$::h96d4b65474d2726b (34 samples, 0.66%) ndarray::impl_methods::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::mapv::_$u7b$$u7b$closure$u7d$$u7d$::h8da8f33936077f61 (34 samples, 0.66%) core::ops::function::FnMut::call_mut (34 samples, 0.66%) std::f64::<impl f64>::exp (34 samples, 0.66%) expf64 (32 samples, 0.62%) [libm-2.27.so] (27 samples, 0.52%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Add<ndarray::ArrayBase<S, D>> for f64>::add (1 samples, 0.02%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Add<B> for ndarray::ArrayBase<S, D>>::add (1 samples, 0.02%) <ndarray::ArrayBase<S, D>>::unordered_foreach_mut (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::as_slice_memory_order_mut (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_contiguous (1 samples, 0.02%) ndarray::dimension::dimension_trait::Dimension::is_contiguous (1 samples, 0.02%) _$LT$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$u20$as$u20$ndarray..dimension..dimension_trait..Dimension$GT$::equal::h151a9fa6ecaa2e70 (1 samples, 0.02%) core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (1 samples, 0.02%) <usize as core::iter::range::Step>::add_usize (1 samples, 0.02%) core::num::<impl usize>::checked_add (1 samples, 0.02%) core::num::<impl usize>::overflowing_add (1 samples, 0.02%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Div<ndarray::ArrayBase<S, D>> for f64>::div (3 samples, 0.06%) <ndarray::ArrayBase<S, D>>::unordered_foreach_mut (2 samples, 0.04%) ndarray::impl_ops::arithmetic_ops::_$LT$impl$u20$core..ops..arith..Div$LT$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$$u20$for$u20$f64$GT$::div::_$u7b$$u7b$closure$u7d$$u7d$::h941efb770ba33155 (1 samples, 0.02%) ndarray::impl_constructors::<impl ndarray::ArrayBase<S, D>>::from_shape_vec_unchecked (2 samples, 0.04%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::as_slice_memory_order (1 samples, 0.02%) nndl_rust::sigmoid (49 samples, 0.95%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Neg for &'a ndarray::ArrayBase<S, D>>::neg (6 samples, 0.12%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::map (6 samples, 0.12%) ndarray::iterators::to_vec_mapped (3 samples, 0.06%) <alloc::vec::Vec<T>>::with_capacity (3 samples, 0.06%) <alloc::raw_vec::RawVec<T>>::with_capacity (3 samples, 0.06%) <alloc::raw_vec::RawVec<T, A>>::allocate_in (3 samples, 0.06%) <alloc::alloc::Global as core::alloc::Alloc>::alloc (3 samples, 0.06%) alloc::alloc::alloc (3 samples, 0.06%) __libc_malloc (3 samples, 0.06%) core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (1 samples, 0.02%) <usize as core::iter::range::Step>::add_usize (1 samples, 0.02%) core::num::<impl usize>::checked_add (1 samples, 0.02%) core::num::<impl usize>::overflowing_add (1 samples, 0.02%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Mul<ndarray::ArrayBase<S2, E>> for ndarray::ArrayBase<S, D>>::mul (2 samples, 0.04%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Mul<&'a ndarray::ArrayBase<S2, E>> for ndarray::ArrayBase<S, D>>::mul (2 samples, 0.04%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::zip_mut_with (2 samples, 0.04%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::zip_mut_with_same_shape (2 samples, 0.04%) ndarray::impl_ops::arithmetic_ops::_$LT$impl$u20$core..ops..arith..Mul$LT$$RF$$u27$a$u20$ndarray..ArrayBase$LT$S2$C$$u20$E$GT$$GT$$u20$for$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::mul::_$u7b$$u7b$closure$u7d$$u7d$::h4a0e51cd65dfa508 (1 samples, 0.02%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Sub<ndarray::ArrayBase<S, D>> for f64>::sub (1 samples, 0.02%) <ndarray::ArrayBase<S, D>>::unordered_foreach_mut (1 samples, 0.02%) core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (1 samples, 0.02%) <usize as core::iter::range::Step>::add_usize (1 samples, 0.02%) core::num::<impl usize>::checked_add (1 samples, 0.02%) core::num::<impl usize>::overflowing_add (1 samples, 0.02%) core::ptr::real_drop_in_place (1 samples, 0.02%) core::ptr::real_drop_in_place (1 samples, 0.02%) core::ptr::real_drop_in_place (1 samples, 0.02%) core::ptr::real_drop_in_place (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A> as core::ops::drop::Drop>::drop (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer (1 samples, 0.02%) <alloc::alloc::Global as core::alloc::Alloc>::dealloc (1 samples, 0.02%) alloc::alloc::dealloc (1 samples, 0.02%) cfree (1 samples, 0.02%) ndarray::impl_constructors::<impl ndarray::ArrayBase<S, D>>::from_shape_vec_unchecked (2 samples, 0.04%) ndarray::impl_constructors::<impl ndarray::ArrayBase<S, D>>::from_vec_dim_stride_unchecked (1 samples, 0.02%) _$LT$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$u20$as$u20$ndarray..dimension..dimension_trait..Dimension$GT$::default_strides::h0e83b923d9e9a1fd (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::as_slice_memory_order (2 samples, 0.04%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_contiguous (2 samples, 0.04%) ndarray::dimension::dimension_trait::Dimension::is_contiguous (2 samples, 0.04%) _$LT$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$u20$as$u20$ndarray..dimension..dimension_trait..Dimension$GT$::equal::h151a9fa6ecaa2e70 (1 samples, 0.02%) <alloc::vec::Vec<T>>::with_capacity (1 samples, 0.02%) <alloc::raw_vec::RawVec<T>>::with_capacity (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A>>::allocate_in (1 samples, 0.02%) <core::slice::Iter<'a, T> as core::iter::traits::iterator::Iterator>::next (1 samples, 0.02%) <core::slice::Iter<'a, T>>::post_inc_start (1 samples, 0.02%) core::ptr::<impl *const T>::offset (1 samples, 0.02%) core::ptr::write (1 samples, 0.02%) @plt (2 samples, 0.04%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::mapv (64 samples, 1.23%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::map (64 samples, 1.23%) ndarray::iterators::to_vec_mapped (60 samples, 1.16%) <core::slice::Iter<'a, T> as core::iter::traits::iterator::Iterator>::fold (58 samples, 1.12%) ndarray::iterators::to_vec_mapped::_$u7b$$u7b$closure$u7d$$u7d$::h96d4b65474d2726b (57 samples, 1.10%) ndarray::impl_methods::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::mapv::_$u7b$$u7b$closure$u7d$$u7d$::h8da8f33936077f61 (56 samples, 1.08%) core::ops::function::FnMut::call_mut (56 samples, 1.08%) std::f64::<impl f64>::exp (56 samples, 1.08%) expf64 (56 samples, 1.08%) [libm-2.27.so] (48 samples, 0.93%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::as_slice_memory_order_mut (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_contiguous (1 samples, 0.02%) ndarray::dimension::dimension_trait::Dimension::is_contiguous (1 samples, 0.02%) _$LT$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$u20$as$u20$ndarray..dimension..dimension_trait..Dimension$GT$::equal::h151a9fa6ecaa2e70 (1 samples, 0.02%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Add<ndarray::ArrayBase<S, D>> for f64>::add (3 samples, 0.06%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Add<B> for ndarray::ArrayBase<S, D>>::add (3 samples, 0.06%) <ndarray::ArrayBase<S, D>>::unordered_foreach_mut (2 samples, 0.04%) ndarray::impl_ops::arithmetic_ops::_$LT$impl$u20$core..ops..arith..Add$LT$B$GT$$u20$for$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::add::_$u7b$$u7b$closure$u7d$$u7d$::h336cfd83104cd862 (1 samples, 0.02%) _$LT$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$u20$as$u20$ndarray..dimension..dimension_trait..Dimension$GT$::default_strides::h0e83b923d9e9a1fd (1 samples, 0.02%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::as_slice_memory_order_mut (2 samples, 0.04%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_contiguous (2 samples, 0.04%) ndarray::dimension::dimension_trait::Dimension::is_contiguous (2 samples, 0.04%) _$LT$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$u20$as$u20$ndarray..dimension..dimension_trait..Dimension$GT$::equal::h151a9fa6ecaa2e70 (1 samples, 0.02%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Div<ndarray::ArrayBase<S, D>> for f64>::div (5 samples, 0.10%) <ndarray::ArrayBase<S, D>>::unordered_foreach_mut (5 samples, 0.10%) ndarray::impl_ops::arithmetic_ops::_$LT$impl$u20$core..ops..arith..Div$LT$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$$u20$for$u20$f64$GT$::div::_$u7b$$u7b$closure$u7d$$u7d$::h941efb770ba33155 (3 samples, 0.06%) ndarray::impl_constructors::<impl ndarray::ArrayBase<S, D>>::from_shape_vec_unchecked (3 samples, 0.06%) nndl_rust::Network::backprop (1,753 samples, 33.82%) nndl_rust::Network::backprop nndl_rust::sigmoid_prime (81 samples, 1.56%) nndl_rust::sigmoid (78 samples, 1.50%) ndarray::impl_ops::arithmetic_ops::<impl core::ops::arith::Neg for &'a ndarray::ArrayBase<S, D>>::neg (5 samples, 0.10%) ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::map (5 samples, 0.10%) ndarray::iterators::to_vec_mapped (2 samples, 0.04%) <alloc::vec::Vec<T>>::with_capacity (2 samples, 0.04%) <alloc::raw_vec::RawVec<T>>::with_capacity (2 samples, 0.04%) <alloc::raw_vec::RawVec<T, A>>::allocate_in (2 samples, 0.04%) <alloc::alloc::Global as core::alloc::Alloc>::alloc (2 samples, 0.04%) alloc::alloc::alloc (2 samples, 0.04%) __libc_malloc (2 samples, 0.04%) nndl_rust::Network::sgd (3,265 samples, 62.99%) nndl_rust::Network::sgd nndl_rust::Network::update_mini_batch (3,065 samples, 59.14%) nndl_rust::Network::update_mini_batch nndl_rust::zero_vec_like (12 samples, 0.23%) core::iter::traits::iterator::Iterator::collect (12 samples, 0.23%) <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter (12 samples, 0.23%) <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::from_iter (12 samples, 0.23%) <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::spec_extend (12 samples, 0.23%) core::iter::traits::iterator::Iterator::for_each (12 samples, 0.23%) <core::iter::adapters::Map<I, F> as core::iter::traits::iterator::Iterator>::fold (12 samples, 0.23%) <core::slice::Iter<'a, T> as core::iter::traits::iterator::Iterator>::fold (12 samples, 0.23%) _$LT$core..iter..adapters..Map$LT$I$C$$u20$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::fold::_$u7b$$u7b$closure$u7d$$u7d$::hcf8d84009fd76c57 (12 samples, 0.23%) nndl_rust::Network::backprop::_$u7b$$u7b$closure$u7d$$u7d$::hc025177ae5e59770 (12 samples, 0.23%) ndarray::impl_constructors::<impl ndarray::ArrayBase<S, D>>::zeros (12 samples, 0.23%) ndarray::impl_constructors::<impl ndarray::ArrayBase<S, D>>::from_elem (12 samples, 0.23%) alloc::vec::from_elem (12 samples, 0.23%) <T as alloc::vec::SpecFromElem>::from_elem (12 samples, 0.23%) <alloc::raw_vec::RawVec<T>>::with_capacity_zeroed (12 samples, 0.23%) <alloc::raw_vec::RawVec<T, A>>::allocate_in (12 samples, 0.23%) <alloc::alloc::Global as core::alloc::Alloc>::alloc_zeroed (12 samples, 0.23%) alloc::alloc::alloc_zeroed (12 samples, 0.23%) __libc_calloc (12 samples, 0.23%) [libc-2.27.so] (12 samples, 0.23%) <alloc::vec::Vec<T>>::push (1 samples, 0.02%) <alloc::vec::Vec<T>>::reserve (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A>>::reserve (1 samples, 0.02%) <alloc::raw_vec::RawVec<T, A>>::reserve_internal (1 samples, 0.02%) <alloc::alloc::Global as core::alloc::Alloc>::realloc (1 samples, 0.02%) alloc::alloc::realloc (1 samples, 0.02%) realloc (1 samples, 0.02%) [libc-2.27.so] (1 samples, 0.02%) [libc-2.27.so] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) <core::iter::adapters::zip::Zip<A, B> as core::iter::traits::iterator::Iterator>::next (1 samples, 0.02%) <core::iter::adapters::zip::Zip<A, B> as core::iter::adapters::zip::ZipImpl<A, B>>::next (1 samples, 0.02%) <core::option::Option<T>>::and_then (1 samples, 0.02%) alloc::slice::<impl [T]>::to_vec (1 samples, 0.02%) alloc::slice::hack::to_vec (1 samples, 0.02%) <alloc::vec::Vec<T>>::extend_from_slice (1 samples, 0.02%) <alloc::vec::Vec<T> as alloc::vec::SpecExtend<&'a T, core::slice::Iter<'a, T>>>::spec_extend (1 samples, 0.02%) [libc-2.27.so] (1 samples, 0.02%) [[kernel.kallsyms]] (11 samples, 0.21%) [[kernel.kallsyms]] (11 samples, 0.21%) [[kernel.kallsyms]] (6 samples, 0.12%) [[kernel.kallsyms]] (6 samples, 0.12%) [[kernel.kallsyms]] (6 samples, 0.12%) [[kernel.kallsyms]] (6 samples, 0.12%) [[kernel.kallsyms]] (6 samples, 0.12%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) <alloc::vec::Vec<T>>::reserve (13 samples, 0.25%) <alloc::raw_vec::RawVec<T, A>>::reserve (13 samples, 0.25%) <alloc::raw_vec::RawVec<T, A>>::reserve_internal (13 samples, 0.25%) <alloc::alloc::Global as core::alloc::Alloc>::alloc (13 samples, 0.25%) alloc::alloc::alloc (13 samples, 0.25%) __libc_malloc (13 samples, 0.25%) [libc-2.27.so] (13 samples, 0.25%) [libc-2.27.so] (2 samples, 0.04%) __default_morecore (1 samples, 0.02%) __sbrk (1 samples, 0.02%) brk (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) core::iter::traits::iterator::Iterator::for_each::_$u7b$$u7b$closure$u7d$$u7d$::hcf552e13efd43222 (3 samples, 0.06%) _$LT$alloc..vec..Vec$LT$T$GT$$u20$as$u20$alloc..vec..SpecExtend$LT$T$C$$u20$I$GT$$GT$::spec_extend::_$u7b$$u7b$closure$u7d$$u7d$::h69c127b79bd9b1ac (3 samples, 0.06%) core::ptr::write (3 samples, 0.06%) [[kernel.kallsyms]] (3 samples, 0.06%) [[kernel.kallsyms]] (3 samples, 0.06%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) core::iter::traits::iterator::Iterator::collect (19 samples, 0.37%) <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter (19 samples, 0.37%) <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::from_iter (19 samples, 0.37%) <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::spec_extend (19 samples, 0.37%) core::iter::traits::iterator::Iterator::for_each (6 samples, 0.12%) <core::iter::adapters::Map<I, F> as core::iter::traits::iterator::Iterator>::fold (6 samples, 0.12%) core::iter::traits::iterator::Iterator::fold (6 samples, 0.12%) core::iter::traits::iterator::Iterator::try_fold (6 samples, 0.12%) core::iter::traits::iterator::Iterator::fold::_$u7b$$u7b$closure$u7d$$u7d$::haecafb67dbc92f7c (6 samples, 0.12%) _$LT$core..iter..adapters..Map$LT$I$C$$u20$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::fold::_$u7b$$u7b$closure$u7d$$u7d$::h1f3da6f259bb5018 (6 samples, 0.12%) nndl_rust::mnist_reader::load_data::_$u7b$$u7b$closure$u7d$$u7d$::hf514a6388114ff12 (3 samples, 0.06%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) <alloc::vec::Vec<T>>::reserve (2 samples, 0.04%) <alloc::raw_vec::RawVec<T, A>>::reserve (2 samples, 0.04%) <alloc::raw_vec::RawVec<T, A>>::reserve_internal (2 samples, 0.04%) <alloc::alloc::Global as core::alloc::Alloc>::realloc (2 samples, 0.04%) alloc::alloc::realloc (2 samples, 0.04%) realloc (2 samples, 0.04%) [libc-2.27.so] (2 samples, 0.04%) [libc-2.27.so] (2 samples, 0.04%) [libc-2.27.so] (1 samples, 0.02%) <flate2::bufreader::BufReader<R> as std::io::BufRead>::fill_buf (1 samples, 0.02%) read (1 samples, 0.02%) read (1 samples, 0.02%) read (1 samples, 0.02%) __libc_read (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) memcpy (1 samples, 0.02%) [libc-2.27.so] (1 samples, 0.02%) <flate2::deflate::bufread::DeflateDecoder<R> as std::io::Read>::read (29 samples, 0.56%) flate2::zio::read (29 samples, 0.56%) <flate2::mem::Decompress as flate2::zio::Ops>::run (28 samples, 0.54%) flate2::mem::Decompress::decompress (28 samples, 0.54%) mz_inflate (28 samples, 0.54%) tinfl_decompress (27 samples, 0.52%) <flate2::gz::read::GzDecoder<R> as std::io::Read>::read (30 samples, 0.58%) <flate2::gz::bufread::GzDecoder<R> as std::io::Read>::read (30 samples, 0.58%) <flate2::crc::CrcReader<R> as std::io::Read>::read (30 samples, 0.58%) crc32fast::Hasher::update (1 samples, 0.02%) crc32fast::specialized::pclmulqdq::State::update (1 samples, 0.02%) crc32fast::baseline::update_fast_16 (1 samples, 0.02%) crc32fast::baseline::update_slow (1 samples, 0.02%) <std::io::cursor::Cursor<T> as std::io::Read>::read (2 samples, 0.04%) std::io::impls::<impl std::io::Read for &[u8]>::read (2 samples, 0.04%) [libc-2.27.so] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) _start (3,327 samples, 64.19%) _start __libc_start_main (3,327 samples, 64.19%) __libc_start_main main (3,327 samples, 64.19%) main lang_start_internal (3,327 samples, 64.19%) lang_start_internal catch_unwind<closure,i32> (3,327 samples, 64.19%) catch_unwind<closure,i32> try<i32,closure> (3,327 samples, 64.19%) try<i32,closure> __rust_maybe_catch_panic (3,327 samples, 64.19%) __rust_maybe_catch_panic do_call<closure,i32> (3,327 samples, 64.19%) do_call<closure,i32> {{closure}} (3,327 samples, 64.19%) {{closure}} std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h780d4f1c15ceddd2 (3,327 samples, 64.19%) std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h780d4f1c15ceddd2 nndl_rust::main (3,327 samples, 64.19%) nndl_rust::main nndl_rust::mnist_reader::load_data (62 samples, 1.20%) nndl_rust::mnist_reader::MnistData::new (40 samples, 0.77%) std::io::Read::read_to_end (40 samples, 0.77%) std::io::read_to_end (40 samples, 0.77%) std::io::read_to_end_with_reservation (40 samples, 0.77%) std::io::Initializer::initialize (6 samples, 0.12%) core::intrinsics::write_bytes (6 samples, 0.12%) [libc-2.27.so] (6 samples, 0.12%) [[kernel.kallsyms]] (6 samples, 0.12%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (4 samples, 0.08%) [[kernel.kallsyms]] (4 samples, 0.08%) [[kernel.kallsyms]] (4 samples, 0.08%) [[kernel.kallsyms]] (4 samples, 0.08%) [[kernel.kallsyms]] (4 samples, 0.08%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (1 samples, 0.02%) cblas_dgemm (1 samples, 0.02%) [[kernel.kallsyms]] (26 samples, 0.50%) [[kernel.kallsyms]] (26 samples, 0.50%) [[kernel.kallsyms]] (26 samples, 0.50%) [[kernel.kallsyms]] (26 samples, 0.50%) [[kernel.kallsyms]] (26 samples, 0.50%) [[kernel.kallsyms]] (26 samples, 0.50%) [[kernel.kallsyms]] (26 samples, 0.50%) [[kernel.kallsyms]] (26 samples, 0.50%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) [ld-2.27.so] (1 samples, 0.02%) __GI___pthread_mutex_lock (3 samples, 0.06%) __GI___pthread_mutex_unlock (2 samples, 0.04%) __pthread_cond_wait (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) [[kernel.kallsyms]] (19 samples, 0.37%) __sched_yield (40 samples, 0.77%) [[kernel.kallsyms]] (37 samples, 0.71%) [[kernel.kallsyms]] (33 samples, 0.64%) [[kernel.kallsyms]] (21 samples, 0.41%) [[kernel.kallsyms]] (20 samples, 0.39%) [[kernel.kallsyms]] (20 samples, 0.39%) [[kernel.kallsyms]] (18 samples, 0.35%) [[kernel.kallsyms]] (15 samples, 0.29%) [[kernel.kallsyms]] (8 samples, 0.15%) [[kernel.kallsyms]] (7 samples, 0.14%) [[kernel.kallsyms]] (7 samples, 0.14%) [[kernel.kallsyms]] (7 samples, 0.14%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [libopenblasp-r0.2.20.so] (1 samples, 0.02%) mmap64 (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) clone (96 samples, 1.85%) c.. start_thread (70 samples, 1.35%) [libopenblasp-r0.2.20.so] (68 samples, 1.31%) blas_memory_alloc (2 samples, 0.04%) __GI___pthread_mutex_lock (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) dgemm_beta_HASWELL (542 samples, 10.46%) dgemm_beta_HASW.. [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) dgemm_kernel_HASWELL (1,170 samples, 22.57%) dgemm_kernel_HASWELL [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (2 samples, 0.04%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) [[kernel.kallsyms]] (1 samples, 0.02%) dgemm_nn (2 samples, 0.04%) nndl-rust (5,178 samples, 99.90%) nndl-rust ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout::is_standard_layout (1 samples, 0.02%) all (5,183 samples, 100%) perf (5 samples, 0.10%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (5 samples, 0.10%) [[kernel.kallsyms]] (4 samples, 0.08%) [[kernel.kallsyms]] (4 samples, 0.08%)
If you’ve never looked at a flamegraph before the idea is that the proportion of
a program’s runtime that occurs in a routine is proportional to the width of the
bar for that routine. The main function is at the bottom of the graph and
functions called by main are stacked on top. This gives you a simple view into
what functions take up the most time in a program - anything that is very “wide”
in the graph is where most of the time is spent and any stack of functions that
is very tall and wide is spending a lot of time in code very deep in a call
stack. Looking at the flamegraph above we can see that about half of the time is
spent in functions with names like dgemm_kernel_HASWELL
— these are
functions in the OpenBLAS linear alebra library. The rest of the time is spent
doing addition between arrays in `update_mini_batch and allocating arrays —
all other parts of my program make a negligible contribution to the runtime.
If we made an analogous flamegraph for the python code, we would see a
similar pattern — most time is spent doing linear algebra (in the places
where np.dot
is called inside the backpropagation routine). So since most of
the time in either Rust or Python is spent inside a numerical linear algebra
library, we can never hope for a 10x speedup.
In fact it’s worse than that. One of the exercises in the book is to rewrite the
Python code to use vectorized matrix multiplication. In this approach the
backpropagation for all of the samples in each mini-batch happens in a single
set of vectorized matrix multiplication operations. This requires the ability to
matrix multiplication between 3D and 2D arrays. Since each matrix multiplication
operation happens using a larger amount of data than the non-vectorized case,
OpenBLAS is able to more efficiently occupy CPU caches and registers, ultimately
better using the available CPU resources on my laptop. The rewritten Python
version ends up faster than the Rust version, again by a factor of two or so.
In principle it’s possible to apply the same optimization to the Rust code,
however the ndarray
crate does not yet
support matrix
multiplication for dimensionalities higher than 2. It might also be possible to use
thread parallelization on the mini-batch updates using a library like
rayon . I tried this on my laptop and did
not see any speedups but might have on a beefier machine with more CPU
threads. I could also have tried using a different low-level linear algebra
implementation, for example there are rust bindings for
tensorflow and
torch , however at that point I feel
like I might as well be using the Python bindings for those libraries.
Is rust suitable for data science workflows?
Right now I have to say that the answer is “not yet”. I’ll definitely reach for
rust in the future when I need to write optimized low-level code with minimal
dependencies. However using it as a full replacement for python or C++ will
require a more stabilized and well-developed ecosystem of packages.