Efficient Sum and Mean¶
In many situations we are not really that interested in the individual loss values (or derivatives) of each observation, but the sum or mean of them; be it weighted or unweighted. For example, by computing the unweighted mean of the loss for our training set, we would effectively compute what is known as the empirical risk. This is usually the quantity (or an important part of it) that the are interesting in minimizing.
When we say “weighted” or “unweighted”, we are referring to
whether we are explicitly specifying the influence of individual
observations on the result. “Weighing” an observation is achieved
by multiplying its value with some number (i.e. the “weight” of
that observation). As a consequence that weighted observation
will have a stronger or weaker influence on the result.
In order to weigh an observation we have to know which array
dimension (if there are more than one) denotes the observations.
On the other hand, for computing an unweighted result we don’t
actually need to know anything about the meaning of the array
dimensions, as long as the targets
and the outputs
are of
compatible shape and size.
The naive way to compute such an unweighted reduction, would be
to call mean
or sum
on the result of the element-wise
operation. The following code snipped show an example of that. We
say “naive”, because it will not give us an acceptable
performance.
julia> value(L1DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
1.0
3.0
5.0
# WARNING: Bad code
julia> sum(value(L1DistLoss(), [1.,2,3], [2,5,-2]))
9.0
This works as expected, but there is a price for it. Before the
sum can be computed, value
will allocate a temporary array
and fill it with the element-wise results. After that, sum
will iterate over this temporary array and accumulate the values
accordingly. Bottom line: we allocate temporary memory that we
don’t need in the end and could avoid.
For that reason we provide special methods that compute the
common accumulations efficiently without allocating temporary
arrays. These methods can be invoked using an additional
parameter which specifies how the values should be accumulated /
averaged. The type of this parameter has to be a subtype of
AverageMode
.
Average Modes¶
Before we discuss these memory-efficient methods, let us briefly
introduce the available average mode types. We provide a number
of different averages modes, all of which are contained within
the namespace AvgMode
. An instance of such type can then be
used as additional parameter to value()
, deriv()
, and
deriv2()
, as we will see further down.
It follows a list of available average modes. Each of which with a short description of what their effect would be when used as an additional parameter to the functions mentioned above.
-
class
AvgMode.
None
¶ Used by default. This will cause the element-wise results to be returned.
-
class
AvgMode.
Sum
¶ Causes the method to return the unweighted sum of the elements instead of the individual elements. Can be used in combination with
ObsDim
, in which case a vector will be returned containing the sum for each observation (useful mainly for multivariable regression).
-
class
AvgMode.
Mean
¶ Causes the method to return the unweighted mean of the elements instead of the individual elements. Can be used in combination with
ObsDim
, in which case a vector will be returned containing the mean for each observation (useful mainly for multivariable regression).
-
class
AvgMode.
WeightedSum
¶ Causes the method to return the weighted sum of all observations. The variable
weights
has to be a vector of the same length as the number of observations. Ifnormalize = true
, the values of the weight vector will be normalized in such as way that they sum to one.-
weights
¶ Vector of weight values that can be used to give certain observations a stronger influence on the sum.
julia> AvgMode.WeightedSum([1,1,2]); # 3 observations
-
normalize
¶ Boolean that specifies if the weight vector should be transformed in such a way that it sums to one (i.e. normalized). This will not mutate the weight vector but instead happen on the fly during the accumulation.
Defaults to
false
. Setting it totrue
only really makes sense in multivalue-regression, otherwise the result will be the same as forWeightedMean
.julia> AvgMode.WeightedSum([1,1,2], normalize = true);
-
-
class
AvgMode.
WeightedMean
¶ Causes the method to return the weighted mean of all observations. The variable
weights
has to be a vector of the same length as the number of observations. Ifnormalize = true
, the values of the weight vector will be normalized in such as way that they sum to one.-
weights
¶ Vector of weight values that can be used to give certain observations a stronger influence on the mean.
julia> AvgMode.WeightedMean([1,1,2]); # 3 observations
-
normalize
¶ Boolean that specifies if the weight vector should be transformed in such a way that it sums to one (i.e. normalized). This will not mutate the weight vector but instead happen on the fly during the accumulation.
Defaults to
true
. Setting it tofalse
only really makes sense in multivalue-regression, otherwise the result will be the same as forWeightedSum
.julia> AvgMode.WeightedMean([1,1,2], normalize = false);
-
Unweighted Sum and Mean¶
As hinted before, we provide special memory efficient methods for
computing the sum or the mean of the element-wise (or
broadcasted) results of value()
, deriv()
, and
deriv2()
. These methods avoid the allocation of a temporary
array and instead compute the result directly.
-
value
(loss, targets, outputs, avgmode) → Number¶ Computes the unweighted sum or mean (depending on avgmode) of the individual values of the loss function for each pair in targets and outputs. This method will not allocate a temporary array.
In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
Returns: The unweighted sum or mean of the individual values of the loss function for all values in targets and outputs.
Return type: Number
- loss (
julia> value(L1DistLoss(), [1,2,3], [2,5,-2], AvgMode.Sum())
9
julia> value(L1DistLoss(), [1.,2,3], [2,5,-2], AvgMode.Sum())
9.0
julia> value(L1DistLoss(), [1,2,3], [2,5,-2], AvgMode.Mean())
3.0
julia> value(L1DistLoss(), Float32[1,2,3], Float32[2,5,-2], AvgMode.Mean())
3.0f0
The exact same method signature is also implemented for
deriv()
and deriv2()
respectively.
-
deriv
(loss, targets, outputs, avgmode) → Number¶ Computes the unweighted sum or mean (depending on avgmode) of the individual derivatives of the loss function for each pair in targets and outputs. This method will not allocate a temporary array.
In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
Returns: The unweighted sum or mean of the individual derivatives of the loss function for all values in targets and outputs.
Return type: Number
- loss (
julia> deriv(L2DistLoss(), [1,2,3], [2,5,-2], AvgMode.Sum())
-2
julia> deriv(L2DistLoss(), [1,2,3], [2,5,-2], AvgMode.Mean())
-0.6666666666666665
-
deriv2
(loss, targets, outputs, avgmode) → Number¶ Computes the unweighted sum or mean (depending on avgmode) of the individual 2nd derivatives of the loss function for each pair in targets and outputs. This method will not allocate a temporary array.
In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
Returns: The unweighted sum or mean of the individual 2nd derivatives of the loss function for all values in targets and outputs.
Return type: Number
- loss (
julia> deriv2(LogitDistLoss(), [1.,2,3], [2,5,-2], AvgMode.Sum())
0.49687329928636825
julia> deriv2(LogitDistLoss(), [1.,2,3], [2,5,-2], AvgMode.Mean())
0.1656244330954561
Sum and Mean per Observation¶
When the targets and predicted outputs are multi-dimensional arrays instead of vectors, we may be interested in accumulating the values over all but one dimension. This is typically the case when we work in a multi-variable regression setting, where each observation has multiple outputs and thus multiple targets. In those scenarios we may be more interested in the average loss for each observation, rather than the total average over all the data.
To be able to accumulate the values for each observation
separately, we have to know and explicitly specify the dimension
that denotes the observations. For that purpose we provide the
types contained in the namespace ObsDim
.
-
value
(loss, targets, outputs, avgmode, obsdim) → Vector Computes the values of the loss function for each pair in targets and outputs individually, and returns either the unweighted sum or mean for each observation (depending on avgmode). This method will not allocate a temporary array, but it will allocate the resulting vector.
Both arrays have to be of the same shape and size. Furthermore they have to have at least two array dimensions (i.e. they must not be vectors).
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The multi-dimensional array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The multi-dimensional array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
- obsdim (ObsDimension) – Specifies which of the array
dimensions denotes the observations.
see
?ObsDim
for more information.
Returns: A vector that contains the unweighted sums / means of the loss for each observation in targets and outputs.
Return type: Vector
- loss (
Consider the following two matrices, targets
and outputs
.
There are two ways to interpret the shape of these arrays if one
dimension is to denote the observations.
julia> targets = rand(2,4)
2×4 Array{Float64,2}:
0.0743675 0.285303 0.247157 0.223666
0.513145 0.59224 0.32325 0.989964
julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
0.6335 0.319131 0.637087 0.613777
0.513495 0.264587 0.533555 0.714688
The first interpretation would be to say that the first dimension denotes the observations. Thus this data would consist of two observations with four variables each.
julia> value(L1DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
1.373
0.813583
julia> value(L1DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.First())
2-element Array{Float64,1}:
0.34325
0.203396
The second possible interpretation would be to say that the second/last dimension denotes the observations. In that case our data consists of four observations with two variables each.
julia> value(L1DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.Last())
4-element Array{Float64,1}:
0.559482
0.36148
0.600235
0.665386
julia> value(L1DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.Last())
4-element Array{Float64,1}:
0.279741
0.18074
0.300118
0.332693
Because this method returns a vector of values, we also provide a mutating version that can make use a preallocated vector to write the results into.
-
value!
(buffer, loss, targets, outputs, avgmode, obsdim) → Vector¶ Computes the values of the loss function for each pair in targets and outputs individually, and returns either the unweighted sum or mean for each observation, depending on avgmode. The results are stored into the given vector buffer. This method will not allocate a temporary array.
Both arrays have to be of the same shape and size. Furthermore they have to have at least two array dimensions (i.e. so they must not be vectors).
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - buffer (AbstractVector) – Array to store the computed values in. Old values will be overwritten and lost.
- loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The multi-dimensional array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The multi-dimensional array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
- obsdim (ObsDimension) – Specifies which of the array
dimensions denotes the observations.
see
?ObsDim
for more information.
Returns: buffer (for convenience).
julia> buffer = zeros(2);
julia> value!(buffer, L1DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
1.373
0.813583
julia> value!(buffer, L1DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.First())
2-element Array{Float64,1}:
0.34325
0.203396
julia> buffer = zeros(4);
julia> value!(buffer, L1DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.Last())
4-element Array{Float64,1}:
0.559482
0.36148
0.600235
0.665386
julia> value!(buffer, L1DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.Last())
4-element Array{Float64,1}:
0.279741
0.18074
0.300118
0.332693
We also provide both of these methods for deriv()
and
deriv2()
respectively.
-
deriv
(loss, targets, outputs, avgmode, obsdim) → Vector
Same as below, but using the 1st derivative.
-
deriv2
(loss, targets, outputs, avgmode, obsdim) → Vector Computes the (2nd) derivatives of the loss function for each pair in targets and outputs individually and returns either the unweighted sum or mean for each observation (depending on avgmode). This method will not allocate a temporary array, but it will allocate the resulting vector.
Both arrays have to be of the same shape and size. Furthermore they have to have at least two array dimensions (i.e. so they must not be vectors).
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The multi-dimensional array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The multi-dimensional array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
- obsdim (ObsDimension) – Specifies which of the array
dimensions denotes the observations.
see
?ObsDim
for more information.
Returns: A vector that contains the unweighted sums / means of the (2nd) loss-derivatives for each observation in targets and outputs.
Return type: Vector
- loss (
julia> targets = rand(2,4)
2×4 Array{Float64,2}:
0.0743675 0.285303 0.247157 0.223666
0.513145 0.59224 0.32325 0.989964
julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
0.6335 0.319131 0.637087 0.613777
0.513495 0.264587 0.533555 0.714688
julia> deriv(L2DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
2.746
-0.784548
julia> deriv(L2DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.First())
2-element Array{Float64,1}:
0.686501
-0.196137
julia> deriv(L2DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.Last())
4-element Array{Float64,1}:
1.11896
-0.58765
1.20047
0.22967
julia> deriv(L2DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.Last())
4-element Array{Float64,1}:
0.559482
-0.293825
0.600235
0.114835
Because this method returns a vector of values, we also provide a mutating version that can make use a preallocated vector to write the results into.
-
deriv!
(buffer, loss, targets, outputs, avgmode, obsdim) → Vector¶ Same as below, but using the 1st derivative.
-
deriv2!
(buffer, loss, targets, outputs, avgmode, obsdim) → Vector¶ Computes the (2nd) derivatives of the loss function for each pair in targets and outputs individually, and returns either the unweighted sums or means for each observation, depending on avgmode. The results are stored into the given vector buffer. This method will not allocate a temporary array.
Both arrays have to be of the same shape and size. Furthermore they have to have at least two array dimensions (i.e. so they must not be vectors).
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - buffer (AbstractVector) – Array to store the computed values in. Old values will be overwritten and lost.
- loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The multi-dimensional array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The multi-dimensional array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
- obsdim (ObsDimension) – Specifies which of the array
dimensions denotes the observations.
see
?ObsDim
for more information.
Returns: buffer (for convenience).
julia> buffer = zeros(2);
julia> deriv!(buffer, L2DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
2.746
-0.784548
julia> deriv!(buffer, L2DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.First())
2-element Array{Float64,1}:
0.686501
-0.196137
julia> buffer = zeros(4);
julia> deriv!(buffer, L2DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.Last())
4-element Array{Float64,1}:
1.11896
-0.58765
1.20047
0.22967
julia> deriv!(buffer, L2DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.Last())
4-element Array{Float64,1}:
0.559482
-0.293825
0.600235
0.114835
Weighted Sum and Mean¶
Up to this point, all the averaging was performed in an unweighted manner. That means that each observation was treated as equal and had thus the same potential influence on the result. In this sub-section we will consider the situations in which we do want to explicitly specify the influence of each observation (i.e. we want to weigh them). When we say we “weigh” an observation, what it effectively boils down to is multiplying the result for that observation (i.e. the computed loss or derivative) with some number. This is done for every observation individually.
To get a better understand of what we are talking about, let us
consider performing a weighting scheme manually. The following
code will compute the loss for three observations, and then
multiply the result of the second observation with the number
2
, while the other two remains as they are. If we then sum up
the results, we will see that the loss of the second observation
was effectively counted twice.
julia> result = value.(L1DistLoss(), [1.,2,3], [2,5,-2]) .* [1,2,1]
3-element Array{Float64,1}:
1.0
6.0
5.0
julia> sum(result)
12.0
The point of weighing observations is to inform the learning algorithm we are working with, that it is more important to us to predict some observations correctly than it is for others. So really, the concrete weight-factor matters less than the ratio between the different weights. In the example above the second observation was thus considered twice as important as any of the other two observations.
In the case of multi-dimensional arrays the process isn’t that simple anymore. In such a scenario, computing the weighted sum (or weighted mean) can be thought of as having an additional step. First we either compute the sum or (unweighted) average for each observation (which results in a vector), and then we compute the weighted sum of all observations.
The following code snipped demonstrates how to compute the
AvgMode.WeightedSum([2,1])
manually. This is not meant as
an example of how to do it, but simply to show what is happening
qualitatively. In this example we assume that we are working in a
multi-variable regression setting, in which our data set has four
observations with two target-variables each.
julia> targets = rand(2,4)
2×4 Array{Float64,2}:
0.0743675 0.285303 0.247157 0.223666
0.513145 0.59224 0.32325 0.989964
julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
0.6335 0.319131 0.637087 0.613777
0.513495 0.264587 0.533555 0.714688
# WARNING: BAD CODE - ONLY FOR ILLUSTRATION
julia> tmp = sum(value.(L1DistLoss(), targets, outputs),2) # assuming ObsDim.First()
2×1 Array{Float64,2}:
1.373
0.813584
julia> sum(tmp .* [2, 1]) # weigh 1st observation twice as high
3.559587
To manually compute the result for
AvgMode.WeightedMean([2,1])
we follow a similar approach, but
use the normalized weight vector in the last step.
# WARNING: BAD CODE - ONLY FOR ILLUSTRATION
julia> tmp = mean(value.(L1DistLoss(), targets, outputs),2) # ObsDim.First()
2×1 Array{Float64,2}:
0.34325
0.203396
julia> sum(tmp .* [0.6666, 0.3333]) # weigh 1st observation twice as high
0.29660258677499995
Note that you can specify explicitly if you want to normalize the
weight vector. That option is supported for computing the
weighted sum, as well as for computing the weighted mean. See the
documentation for AvgMode.WeightedSum
and
AvgMode.WeightedMean
for more information.
The code-snippets above are of course very inefficient, because
they allocate (multiple) temporary arrays. We only included them
to demonstrate what is happening in terms of desired result /
effect. For doing those computations efficiently we provide
special methods for value()
, deriv()
, deriv2()
and their mutating counterparts.
-
value
(loss, targets, outputs, wavgmode[, obsdim]) → Number Computes the values of the loss function for each pair in targets and outputs individually and returns either the weighted sum or mean for each observation (depending on wavgmode). This method will not allocate a temporary array. Both arrays have to be of the same shape and size.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
- wavgmode (AverageMode) – Must either be of type
AvgMode.WeightedSum
orAvgMode.WeightedMean
. Either way, the specified weight vector must have the same number of observations as targets and outputs. - obsdim (ObsDimension) – Optional. Default to
ObsDim.Last()
. Specifies which of the array dimensions denotes the observations. see?ObsDim
for more information.
Returns: A vector that contains the unweighted sums / means of the loss for each observation in targets and outputs.
Return type: Number
- loss (
julia> value(L1DistLoss(), [1.,2,3], [2,5,-2], AvgMode.WeightedSum([1,2,1]))
12.0
julia> value(L1DistLoss(), [1.,2,3], [2,5,-2], AvgMode.WeightedMean([1,2,1]))
3.0
julia> targets = rand(2,4)
2×4 Array{Float64,2}:
0.0743675 0.285303 0.247157 0.223666
0.513145 0.59224 0.32325 0.989964
julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
0.6335 0.319131 0.637087 0.613777
0.513495 0.264587 0.533555 0.714688
julia> value(L1DistLoss(), targets, outputs, AvgMode.WeightedSum([2,1]), ObsDim.First())
3.5595869999999996
julia> value(L1DistLoss(), targets, outputs, AvgMode.WeightedMean([2,1]), ObsDim.First())
0.29663224999999993
We also provide both of these methods for deriv()
and
deriv2()
respectively.
-
deriv
(loss, targets, outputs, wavgmode[, obsdim]) → Number
Same as below, but using the 1st derivative.
-
deriv2
(loss, targets, outputs, wavgmode[, obsdim]) → Number Computes the (2nd) derivatives of the loss function for each pair in targets and outputs individually and returns either the weighted sum or mean for each observation (depending on wavgmode). This method will not allocate a temporary array. Both arrays have to be of the same shape and size.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
- wavgmode (AverageMode) – Must either be of type
AvgMode.WeightedSum
orAvgMode.WeightedMean
. Either way, the specified weight vector must have the same number of observations as targets and outputs. - obsdim (ObsDimension) – Optional. Default to
ObsDim.Last()
. Specifies which of the array dimensions denotes the observations. see?ObsDim
for more information.
Returns: A vector that contains the unweighted sums / means of the loss-derivatives for each observation in targets and outputs.
Return type: Number
- loss (
julia> deriv(L2DistLoss(), [1.,2,3], [2,5,-2], AvgMode.WeightedSum([1,2,1]))
4.0
julia> deriv(L2DistLoss(), [1.,2,3], [2,5,-2], AvgMode.WeightedMean([1,2,1]))
1.0
julia> targets = rand(2,4)
2×4 Array{Float64,2}:
0.0743675 0.285303 0.247157 0.223666
0.513145 0.59224 0.32325 0.989964
julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
0.6335 0.319131 0.637087 0.613777
0.513495 0.264587 0.533555 0.714688
julia> deriv(L2DistLoss(), targets, outputs, AvgMode.WeightedSum([2,1]), ObsDim.First())
4.707458000000001
julia> value(L2DistLoss(), targets, outputs, AvgMode.WeightedMean([2,1]), ObsDim.First())
0.12194772056937497