LossFunctions.jl’s documentation¶
This package represents a community effort to centralize the definition and implementation of loss functions in Julia. As such, it is a part of the JuliaML ecosystem.
The sole purpose of this package is to provide an efficient and extensible implementation of various loss functions used throughout Machine Learning (ML). It is thus intended to serve as a special purpose back-end for other ML libraries that require losses to accomplish their tasks. To that end we provide a considerable amount of carefully implemented loss functions, as well as an API to query their properties (e.g. convexity). Furthermore, we expose methods to compute their values, derivatives, and second derivatives for single observations as well as arbitrarily sized arrays of observations. In the case of arrays a user additionally has the ability to define if and how element-wise results are averaged or summed over.
From an end-user’s perspective one normally does not need to import this package directly. That said, it should provide a decent starting point for any student that is interested in investigating the properties or behaviour of loss functions.
Where to begin?¶
If this is the first time you consider using LossFunctions for your machine learning related experiments or packages, make sure to check out the “Getting Started” section.
Getting Started¶
LossFunctions.jl is the result of a collaborative effort to design and implement an efficient but also convenient-to-use Julia library for, well, loss functions. As such, this package implements the functionality needed to query various properties about a loss function (such as convexity), as well as a number of methods to compute its value, derivative, and second derivative for single observations or arrays of observations.
In this section we will provide a condensed overview of the package. In order to keep this overview concise, we will not discuss any background information or theory on the losses here in detail.
Installation¶
To install LossFunctions.jl, start up Julia and type the following code-snipped into the REPL. It makes use of the native Julia package manger.
Pkg.add("LossFunctions")
Additionally, for example if you encounter any sudden issues, or in the case you would like to contribute to the package, you can manually choose to be on the latest (untagged) version.
Pkg.checkout("LossFunctions")
Overview¶
Let us take a look at a few examples to get a feeling of how one can use this library. This package is registered in the Julia package ecosystem. Once installed the package can be imported as usual.
using LossFunctions
Typically, the losses we work with in Machine Learning are multivariate functions of two variables, the true target \(y\), which represents the “ground truth” (i.e. correct answer), and the predicted output \(\hat{y}\), which is what our model thinks the truth is. All losses that can be expressed in this way will be referred to as supervised losses. The true targets are often expected to be of a specific set (e.g. \(\{1,-1\}\) in classification), which we will refer to as \(Y\), while the predicted outputs may be any real number. So for our purposes we can define a supervised loss as follows
Such a loss function takes these two variables as input and returns a value that quantifies how “bad” our prediction is in comparison to the truth. In other words: the lower the loss, the better the prediction.
From an implementation perspective, we should point out that all
the concrete loss “functions” that this package provides are
actually defined as immutable types, instead of native Julia
functions. We can compute the value of some type of loss using
the function value()
. Let us start with an example of how
to compute the loss of a single observation (i.e. two numbers).
# loss y ŷ
julia> value(L2DistLoss(), 1.0, 0.5)
0.25
Calling the same function using arrays instead of numbers will return the element-wise results, and thus basically just serve as a wrapper for broadcast (which by the way is also supported).
julia> true_targets = [ 1, 0, -2];
julia> pred_outputs = [0.5, 2, -1];
julia> value(L2DistLoss(), true_targets, pred_outputs)
3-element Array{Float64,1}:
0.25
4.0
1.0
Alternatively, one can also use an instance of a loss just like one would use any other Julia function. This can make the code significantly more readable while not impacting performance, as it is a zero-cost abstraction (i.e. it compiles down to the same code).
julia> loss = L2DistLoss()
LossFunctions.LPDistLoss{2}()
julia> loss(true_targets, pred_outputs) # same result as above
3-element Array{Float64,1}:
0.25
4.0
1.0
julia> loss(1, 0.5f0) # single observation
0.25f0
If you are not actually interested in the element-wise results individually, but some accumulation of those (such as mean or sum), you can additionally specify an average mode. This will avoid allocating a temporary array and directly compute the result.
julia> value(L2DistLoss(), true_targets, pred_outputs, AvgMode.Sum())
5.25
julia> value(L2DistLoss(), true_targets, pred_outputs, AvgMode.Mean())
1.75
Aside from these standard unweighted average modes, we also provide weighted alternatives. These expect a weight-factor for each observation in the predicted outputs and so allow to give certain observations a stronger influence over the result.
julia> value(L2DistLoss(), true_targets, pred_outputs, AvgMode.WeightedSum([2,1,1]))
5.5
julia> value(L2DistLoss(), true_targets, pred_outputs, AvgMode.WeightedMean([2,1,1]))
1.375
We do not restrict the targets and outputs to be vectors, but instead allow them to be arrays of any arbitrary shape. The shape of an array may or may not have an interpretation that is relevant for computing the loss. Consequently, those methods that don’t require this information can be invoked using the same method signature as before, because the results are simply computed element-wise or accumulated.
julia> A = rand(2,3)
2×3 Array{Float64,2}:
0.0939946 0.97639 0.568107
0.183244 0.854832 0.962534
julia> B = rand(2,3)
2×3 Array{Float64,2}:
0.0538206 0.77055 0.996922
0.598317 0.72043 0.912274
julia> value(L2DistLoss(), A, B)
2×3 Array{Float64,2}:
0.00161395 0.0423701 0.183882
0.172286 0.0180639 0.00252607
julia> value(L2DistLoss(), A, B, AvgMode.Sum())
0.420741920634
These methods even allow arrays of different dimensionality, in which case broadcast is performed. This also applies to computing the sum and mean, in which case we use custom broadcast implementations that avoid allocating a temporary array.
julia> value(L2DistLoss(), rand(2), rand(2,2))
2×2 Array{Float64,2}:
0.228077 0.597212
0.789808 0.311914
julia> value(L2DistLoss(), rand(2), rand(2,2), AvgMode.Sum())
0.0860658081865589
That said, it is possible to explicitly specify which dimension denotes the observations. This is particularly useful for multivariate regression where one could want to accumulate the loss per individual observation.
julia> value(L2DistLoss(), A, B, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
0.227866
0.192876
julia> value(L2DistLoss(), A, B, AvgMode.Sum(), ObsDim.Last())
3-element Array{Float64,1}:
0.1739
0.060434
0.186408
julia> value(L2DistLoss(), A, B, AvgMode.WeightedSum([2,1]), ObsDim.First())
0.648608280735
All these function signatures of value()
also apply for
computing the derivatives using deriv()
and the second
derivatives using deriv2()
.
julia> true_targets = [ 1, 0, -2];
julia> pred_outputs = [0.5, 2, -1];
julia> deriv(L2DistLoss(), true_targets, pred_outputs)
3-element Array{Float64,1}:
-1.0
4.0
2.0
julia> deriv2(L2DistLoss(), true_targets, pred_outputs)
3-element Array{Float64,1}:
2.0
2.0
2.0
Additionally, we provide mutating versions for the subset of methods that return an array. These have the same function signatures with the only difference of requiring an additional parameter as the first argument. This variable should always be the preallocated array that is to be used as storage.
julia> buffer = zeros(3)
3-element Array{Float64,1}:
0.0
0.0
0.0
julia> deriv!(buffer, L2DistLoss(), true_targets, pred_outputs)
3-element Array{Float64,1}:
-1.0
4.0
2.0
Getting Help¶
To get help on specific functionality you can either look up the
information here, or if you prefer you can make use of Julia’s
native doc-system.
The following example shows how to get additional information
on L1HingeLoss
within Julia’s REPL:
?L1HingeLoss
search: L1HingeLoss SmoothedL1HingeLoss
L1HingeLoss <: MarginLoss
The hinge loss linearly penalizes every predicition where the resulting
agreement <= 1 . It is Lipschitz continuous and convex, but not strictly
convex.
L(y, ŷ) = max(0, 1 - y⋅ŷ)
Lossfunction Derivative
┌────────────┬────────────┐ ┌────────────┬────────────┐
3 │'\. │ 0 │ ┌------│
│ ''_ │ │ | │
│ \. │ │ | │
│ '. │ │ | │
L │ ''_ │ L' │ | │
│ \. │ │ | │
│ '. │ │ | │
0 │ ''_______│ -1 │------------------┘ │
└────────────┴────────────┘ └────────────┴────────────┘
-2 2 -2 2
y ⋅ ŷ y ⋅ ŷ
If you find yourself stuck or have other questions concerning the package you can find us at gitter or the Machine Learning domain on discourse.julialang.org
If you encounter a bug or would like to participate in the further development of this package come find us on Github.
Introduction and Motivation¶
If you are new to Machine Learning in Julia, or are simply interested in how and why this package works the way it works, feel free to take a look at the following sections. There we discuss the concepts involved and outline the most important terms and definitions.
Background and Motivation¶
In this section we will discuss the concept “loss function” in more detail. We will start by introducing some terminology and definitions. However, please note that we won’t attempt to give a complete treatment of loss functions and the math involved (unlike a book or a lecture could do). So this section won’t be a substitution for proper literature on the topic. While we will try to cover all the basics necessary to get a decent intuition of the ideas involved, we do assume basic knowledge about Machine Learning.
Warning
This section and its sub-sections serve soley as to explain the underyling theory and concepts and further to motivate the solution provided by this package. As such, this section is not intended as a guide on how to apply this package.
Terminology¶
To start off, let us go over some basic terminology. In Machine Learning (ML) we are primarily interested in automatically learning meaningful patterns from data. For our purposes it suffices to say that in ML we try to teach the computer to solve a task by induction rather than by definition. This package is primarily concerned with the subset of Machine Learning that falls under the umbrella of Supervised Learning. There we are interested in teaching the computer to predict a specific output for some given input. In contrast to unsupervised learning the teaching process here involves showing the computer what the predicted output is supposed to be; i.e. the “true answer” if you will.
How is this relevant for this package? Well, it implies that we require some meaningful way to show the true answers to the computer so that it can learn from “seeing” them. More importantly, we have to somehow put the true answer into relation to what the computer currently predicts the answer should be. This would provide the basic information needed for the computer to be able to improve; that is what loss functions are for.
When we say we want our computer to learn something that is able to make predictions, we are talking about a prediction function, denoted as \(h\) and sometimes called “fitted hypothesis”, or “fitted model”. Note that we will avoid the term hypothesis for the simple reason that it is widely used in statistics for something completely different. We don’t consider a prediction function as the same thing as a prediction model, because we think of a prediction model as a family of prediction functions. What that boils down to is that the prediction model represents the set of possible prediction functions, while the final prediction function is the chosen function that best solves the problem. So in a way a prediction model can be thought of as the manifestation of our assumptions about the problem, because it restricts the solution to a specific family of functions. For example a linear prediction model for two features represents all possible linear functions that have two coefficients. A prediction function would in that scenario be a concrete linear function with a particular fixed set of coefficients.
The purpose of a prediction function is to take some input and produce a corresponding output. That output should be as faithful as possible to the true answer. In the context of this package we will refer to the “true answer” as the true target, or short “target”. During training, and only during training, inputs and targets can both be considered as part of our data set. We say “only during training” because in a production setting we don’t actually have the targets available to us (otherwise there would be no prediction problem to solve in the first place). In essence we can think of our data as two entities with a 1-to-1 connection in each observation, the inputs, which we call features, and the corresponding desired outputs, which we call true targets.
Let us be a little more concrete with the two terms we really care about in this package.
- True Targets
A true target (singular) represents the “desired” output for the input features of the observation. The targets are often referred to as “ground truth” and we will denote a single target as \(y \in Y\). When we talk about an array (e.g. a vector) of targets, we will print it in bold as \(\mathbf{y}\). What the set \(Y\) is will depend on the subdomain of supervised learning that you are working in.
- Real-valued Regression: \(Y \subseteq \mathbb{R}\).
- Multioutput Regression: \(Y \subseteq \mathbb{R}^k\).
- Margin-based Classification: \(Y = \{1,-1\}\).
- Probabilistic Classification: \(Y = \{1,0\}\).
- Multiclass Classification: \(Y = \{1,2,\dots,k\}\)
See MLLabelUtils for more information on classification targets.
- Predicted Outputs
A predicted output (singular) is the result of our prediction function given the features of some observation. We will denote a single output as \(\hat{y} \in \mathbb{R}\) (pronounced as “why hat”). When we talk about an array of outputs, we will print it in bold as \(\mathbf{\hat{y}}\). Note something unintuitive but important: The variables \(y\) and \(\hat{y}\) don’t have to be of the same set. Even in a classification setting where \(y \in \{1,-1\}\), it is typical that \(\hat{y} \in \mathbb{R}\).
The fact that in classification the predictions can be fundamentally different than the targets is important to know. The reason for restricting the targets to specific numbers when doing classification is mathematical convenience for loss functions. So loss functions have this knowledge build in.
In a classification setting, the predicted outputs and the true targets are usually of different form and type. For example, in margin-based classification it could be the case that the target \(y=-1\) and the predicted output \(\hat{y} = -1000\). It would seem that the prediction is not really reflecting the target properly, but in this case we would actually have a perfectly correct prediction. This is because in margin-based classification the main thing that matters about the predicted output is that the sign agrees with the true target.
Definitions¶
We base most of our definitions on the work presented in [STEINWART2008]. Note, however, that we will adapt or simplify in places at our discretion. We do this in situations where it makes sense to us considering the scope of this package or because of implementation details.
Let us again consider the term prediction function. More formally, a prediction function \(h\) is a function that maps an input from the feature space \(X\) to the real numbers \(\mathbb{R}\). So invoking \(h\) with some features \(x \in X\) will produce the prediction \(\hat{y} \in \mathbb{R}\).
This resulting prediction \(\hat{y}\) is what we want to compare to the target \(y\) in order to asses how bad the prediction is. The function we use for such an assessment will be of a family of functions we refer to as supervised losses. We think of a supervised loss as a function of two parameters, the true target \(y \in Y\) and the predicted output \(\hat{y} \in \mathbb{R}\). The result of computing such a loss will be a non-negative real number. The larger the value of the loss, the worse the prediction.
Note a few interesting things about supervised loss functions.
- The absolute value of a loss is often (but not always) meaningless and doesn’t offer itself to a useful interpretation. What we usually care about is that the loss is as small as it can be.
- In general the loss function we use is not the function we are actually interested in minimizing. Instead we are minimizing what is referred to as a “surrogate”. For binary classification for example we are really interested in minimizing the ZeroOne loss (which simply counts the number of misclassified predictions). However, that loss is difficult to minimize given that it is not convex nor continuous. That is why we use other loss functions, such as the hinge loss or logistic loss. Those losses are “classification calibrated”, which basically means they are good enough surrogates to solve the same problem. Additionally, surrogate losses tend to have other nice properties.
- For classification it does not need to be the case that a “correct” prediction has a loss of zero. In fact some classification calibrated losses are never truly zero.
Alternative Viewpoints¶
While the term “loss function” is usually used in the same context throughout the literature, the specifics differ from one textbook to another. For that reason we would like to mention alternative definitions of what a “loss function” is. Note that we will only give a partial and thus very simplified description of these. Please refer to the listed sources for more specifics.
In [SHALEV2014] the authors consider a loss function as a higher-order function of two parameters, a prediction model and an observation tuple. So in that definition a loss function and the prediction function are tightly coupled. This way of thinking about it makes a lot of sense, considering the process of how a prediction model is usually fit to the data. For gradient descent to do its job it needs the, well, gradient of the empirical risk. This gradient is computed using the chain rule for the inner loss and the prediction model. If one views the loss and the prediction model as one entity, then the gradient can sometimes be simplified immensely. That said, we chose to not follow this school of thought, because from a software-engineering standpoint it made more sense to us to have small modular pieces. So in our implementation the loss functions don’t need to know that prediction functions even exist. This makes the package easier to maintain, test, and reason with. Given Julia’s ability for multiple dispatch we don’t even lose the ability to simplify the gradient if need be.
[SHALEV2014] | Shalev-Shwartz, Shai, and Shai Ben-David. “Understanding machine learning: From theory to algorithms”. Cambridge University Press, 2014. |
API Documentation¶
This section gives a more detailed treatment of the exposed functions and their available methods. We will start by describing how to instantiate a loss, as well as the basic interface that all loss functions share.
Working with Losses¶
Even though they are called loss “functions”, this package
implements them as immutable types instead of true Julia
functions. There are good reasons for that. For example it allows
us to specify the properties of losse functions explicitly (e.g.
isconvex(myloss)
). It also makes for a more consistent API
when it comes to computing the value or the derivative. Some loss
functions even have additional parameters that need to be
specified, such as the \(\epsilon\) in the case of the
\(\epsilon\)-insensitive loss. Here, types allow for member
variables to hide that information away from the method
signatures.
In order to avoid potential confusions with true Julia functions, we will refer to “loss functions” as “losses” instead. The available losses share a common interface for the most part. This section will provide an overview of the basic functionality that is available for all the different types of losses. We will discuss how to create a loss, how to compute its value and derivative, and how to query its properties.
Instantiating a Loss¶
Losses are immutable types. As such, one has to instantiate one in order to work with it. For most losses, the constructors do not expect any parameters.
julia> L2DistLoss()
LossFunctions.LPDistLoss{2}()
julia> HingeLoss()
LossFunctions.L1HingeLoss()
We just said that we need to instantiate a loss in order to work with it. One could be inclined to belief, that it would be more memory-efficient to “pre-allocate” a loss when using it in more than one place.
julia> loss = L2DistLoss()
LossFunctions.LPDistLoss{2}()
julia> value(loss, 2, 3)
1
However, that is a common oversimplification. Because all losses are immutable types, they can live on the stack and thus do not come with a heap-allocation overhead.
Even more interesting in the example above, is that for such
losses as L2DistLoss
, which do not have any constructor
parameters or member variables, there is no additional code
executed at all. Such singletons are only used for dispatch and
don’t even produce any additional code, which you can observe for
yourself in the code below. As such they are zero-cost
abstractions.
julia> v1(loss,t,y) = value(loss,t,y)
julia> v2(t,y) = value(L2DistLoss(),t,y)
julia> @code_llvm v1(loss, 2, 3)
define i64 @julia_v1_70944(i64, i64) #0 {
top:
%2 = sub i64 %1, %0
%3 = mul i64 %2, %2
ret i64 %3
}
julia> @code_llvm v2(2, 3)
define i64 @julia_v2_70949(i64, i64) #0 {
top:
%2 = sub i64 %1, %0
%3 = mul i64 %2, %2
ret i64 %3
}
On the other hand, some types of losses are actually more
comparable to whole families of losses instead of just a single
one. For example, the immutable type L1EpsilonInsLoss
has a free parameter \(\epsilon\). Each concrete
\(\epsilon\) results in a different concrete loss of the same
family of epsilon-insensitive losses.
julia> L1EpsilonInsLoss(0.5)
LossFunctions.L1EpsilonInsLoss{Float64}(0.5)
julia> L1EpsilonInsLoss(1)
LossFunctions.L1EpsilonInsLoss{Float64}(1.0)
For such losses that do have parameters, it can make a slight difference to pre-instantiate a loss. While they will live on the stack, the constructor usually performs some assertions and conversion for the given parameter. This can come at a slight overhead. At the very least it will not produce the same exact code when pre-instantiated. Still, the fact that they are immutable makes them very efficient abstractions with little to no performance overhead, and zero memory allocations on the heap.
Computing the Values¶
The first thing we may want to do is compute the loss for some
observation (singular). In fact, all losses are implemented on
single observations under the hood. The core function to compute
the value of a loss is value()
. We will see throughout the
documentation that this function allows for a lot of different
method signatures to accomplish a variety of tasks.
-
value
(loss, target, output) → Number¶ Computes the result for the loss-function denoted by the parameter loss. Note that target and output can be of different numeric type, in which case promotion is performed in the manner appropriate for the given loss.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
\[L : Y \times \mathbb{R} \rightarrow [0,\infty)\]Parameters: - loss (
SupervisedLoss
) – The loss-function \(L\) we want to compute the value with. - target (Number) – The ground truth \(y \in Y\) of the observation.
- output (Number) – The predicted output \(\hat{y} \in \mathbb{R}\) for the observation.
Returns: The (non-negative) numeric result of the loss-function for the given parameters.
- loss (
# loss y ŷ
julia> value(L1DistLoss(), 1.0, 2.0)
1.0
julia> value(L1DistLoss(), 1, 2)
1
julia> value(L1HingeLoss(), -1, 2)
3
julia> value(L1HingeLoss(), -1f0, 2f0)
3.0f0
It may be interesting to note, that this function also supports broadcasting and all the syntax benefits that come with it. Thus, it is quite simple to make use of preallocated memory for storing the element-wise results.
julia> value.(L1DistLoss(), [1,2,3], [2,5,-2])
3-element Array{Int64,1}:
1
3
5
julia> buffer = zeros(3); # preallocate a buffer
julia> buffer .= value.(L1DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
1.0
3.0
5.0
Furthermore, with the loop fusion changes that were introduced in Julia 0.6, one can also easily weight the influence of each observation without allocating a temporary array.
julia> buffer .= value.(L1DistLoss(), [1.,2,3], [2,5,-2]) .* [2,1,0.5]
3-element Array{Float64,1}:
2.0
3.0
2.5
Even though broadcasting is supported, we do expose a vectorized method natively. This is done mainly for API consistency reasons. Internally it even uses broadcast itself, but it does provide the additional benefit of a more reliable type-inference.
-
value
(loss, targets, outputs) → Array Computes the value of the loss function for each index-pair in targets and outputs individually and returns the result as an array of the appropriate size.
In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we want to compute the values for. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
Returns: The element-wise results of the loss function for all values in targets and outputs.
- loss (
julia> value(L1DistLoss(), [1,2,3], [2,5,-2])
3-element Array{Int64,1}:
1
3
5
julia> value(L1DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
1.0
3.0
5.0
We also provide a mutating version for the same reasons. It
even utilizes broadcast!
underneath.
-
value!
(buffer, loss, targets, outputs)¶ Computes the value of the loss function for each index-pair in targets and outputs individually, and stores them in the preallocated buffer, which has to be of the appropriate size.
In the case that the two parameters, targets and outputs, are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - buffer (AbstractArray) – Array to store the computed values in. Old values will be overwritten and lost.
- loss (
SupervisedLoss
) – The loss-function we want to compute the values for. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
Returns: buffer (for convenience).
julia> buffer = zeros(3); # preallocate a buffer
julia> value!(buffer, L1DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
1.0
3.0
5.0
Computing the 1st Derivatives¶
Maybe the more interesting aspect of loss functions are their derivatives. In fact, most of the popular learning algorithm in Supervised Learning, such as gradient descent, utilize the derivatives of the loss in one way or the other during the training process.
To compute the derivative of some loss we expose the function
deriv()
. It supports the same exact method signatures as
value()
. It may be interesting to note explicitly, that we
always compute the derivative in respect to the predicted
output
, since we are interested in deducing in which
direction the output should change.
-
deriv
(loss, target, output) → Number¶ Computes the derivative for the loss-function denoted by the parameter loss in respect to the output. Note that target and output can be of different numeric type, in which case promotion is performed in the manner appropriate for the given loss.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function \(L\) we want to compute the derivative with. - target (Number) – The ground truth \(y \in Y\) of the observation.
- output (Number) – The predicted output \(\hat{y} \in \mathbb{R}\) for the observation.
Returns: The derivative of the loss-function for the given parameters.
- loss (
# loss y ŷ
julia> deriv(L2DistLoss(), 1.0, 2.0)
2.0
julia> deriv(L2DistLoss(), 1, 2)
2
julia> deriv(L2HingeLoss(), -1, 2)
6
julia> deriv(L2HingeLoss(), -1f0, 2f0)
6.0f0
Similar to value()
, this function also supports
broadcasting and all the syntax benefits that come with it. Thus,
one can make use of preallocated memory for storing the
element-wise derivatives.
julia> deriv.(L2DistLoss(), [1,2,3], [2,5,-2])
3-element Array{Int64,1}:
2
6
-10
julia> buffer = zeros(3); # preallocate a buffer
julia> buffer .= deriv.(L2DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
2.0
6.0
-10.0
Furthermore, with the loop fusion changes that were introduced in Julia 0.6, one can also easily weight the influence of each observation without allocating a temporary array.
julia> buffer .= deriv.(L2DistLoss(), [1.,2,3], [2,5,-2]) .* [2,1,0.5]
3-element Array{Float64,1}:
4.0
6.0
-5.0
While broadcast is supported, we do expose a vectorized method natively. This is done mainly for API consistency reasons. Internally it even uses broadcast itself, but it does provide the additional benefit of a more reliable type-inference.
-
deriv
(loss, targets, outputs) → Array Computes the derivative of the loss function in respect to the output for each index-pair in targets and outputs individually and returns the result as an array of the appropriate size.
In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we want to compute the derivative for. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
Returns: The element-wise derivatives of the loss function for all elements in targets and outputs.
- loss (
julia> deriv(L2DistLoss(), [1,2,3], [2,5,-2])
3-element Array{Int64,1}:
2
6
-10
julia> deriv(L2DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
2.0
6.0
-10.0
We also provide a mutating version for the same reasons. It
even utilizes broadcast!
underneath.
-
deriv!
(buffer, loss, targets, outputs)¶ Computes the derivatives of the loss function in respect to the outputs for each index-pair in targets and outputs individually, and stores them in the preallocated buffer, which has to be of the appropriate size.
In the case that the two parameters targets and outputs are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - buffer (AbstractArray) – Array to store the computed derivatives in. Old values will be overwritten and lost.
- loss (
SupervisedLoss
) – The loss-function we want to compute the derivatives for. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
Returns: buffer (for convenience).
julia> buffer = zeros(3); # preallocate a buffer
julia> deriv!(buffer, L2DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
2.0
6.0
-10.0
It is also possible to compute the value and derivative at the same time. For some losses that means less computation overhead.
-
value_deriv
(loss, target, output) → Tuple¶ Returns the results of
value()
andderiv()
as a tuple. In some cases this function can yield better performance, because the losses can make use of shared variables when computing the results. Note that target and output can be of different numeric type, in which case promotion is performed in the manner appropriate for the given loss.Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are working with. - target (Number) – The ground truth \(y \in Y\) of the observation.
- output (Number) – The predicted output \(\hat{y} \in \mathbb{R}\) for the observation.
Returns: The value and the derivative of the loss-function for the given parameters. They are returned as a Tuple in which the first element is the value and the second element the derivative.
- loss (
# loss y ŷ
julia> value_deriv(L2DistLoss(), -1.0, 3.0)
(16.0,8.0)
Computing the 2nd Derivatives¶
Additionally to the first derivative, we also provide the
corresponding methods for the second derivative through the
function deriv2()
. Note again, that we always compute the
derivative in respect to the predicted output
.
-
deriv2
(loss, target, output) → Number¶ Computes the second derivative for the loss-function denoted by the parameter loss in respect to the output. Note that target and output can be of different numeric type, in which case promotion is performed in the manner appropriate for the given loss.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function \(L\) we want to compute the second derivative with. - target (Number) – The ground truth \(y \in Y\) of the observation.
- output (Number) – The predicted output \(\hat{y} \in \mathbb{R}\) for the observation.
Returns: The second derivative of the loss-function for the given parameters.
- loss (
# loss y ŷ
julia> deriv2(LogitDistLoss(), -0.5, 0.3)
0.42781939304058886
julia> deriv2(LogitMarginLoss(), -1f0, 2f0)
0.104993574f0
Just like deriv()
and value()
, this function also
supports broadcasting and all the syntax benefits that come with
it. Thus, one can make use of preallocated memory for storing the
element-wise derivatives.
julia> deriv2.(LogitDistLoss(), [-0.5, 1.2, 3], [0.3, 2.3, -2])
3-element Array{Float64,1}:
0.427819
0.37474
0.0132961
julia> buffer = zeros(3); # preallocate a buffer
julia> buffer .= deriv2.(LogitDistLoss(), [-0.5, 1.2, 3], [0.3, 2.3, -2])
3-element Array{Float64,1}:
0.427819
0.37474
0.0132961
Furthermore deriv2()
supports all the same method
signatures as deriv()
does. So to avoid repeating the same
text over and over again, please look at the documentation of
deriv()
for more information.
Function Closures¶
In some circumstances it may be convenient to have the loss function
or its derivative as a proper Julia function. Instead of
exporting special function names for every implemented loss (like
l2distloss(...)
), we provide the ability to generate a true
function on the fly for any given loss.
-
value_fun
(loss) → Function¶ Returns a new function that computes the
value()
for the given loss. This new function will support all the signatures thatvalue()
does.Parameters: loss (Loss) – The loss we want the function for.
julia> f = value_fun(L2DistLoss())
(::_value) (generic function with 1 method)
julia> f(-1.0, 3.0) # computes the value of L2DistLoss
16.0
julia> f.([1.,2], [4,7])
2-element Array{Float64,1}:
9.0
25.0
-
deriv_fun
(loss) → Function¶ Returns a new function that computes the
deriv()
for the given loss. This new function will support all the signatures thatderiv()
does.Parameters: loss (Loss) – The loss we want the derivative-function for.
julia> g = deriv_fun(L2DistLoss())
(::_deriv) (generic function with 1 method)
julia> g(-1.0, 3.0) # computes the deriv of L2DistLoss
8.0
julia> g.([1.,2], [4,7])
2-element Array{Float64,1}:
6.0
10.0
-
deriv2_fun
(loss) → Function¶ Returns a new function that computes the
deriv2()
(i.e. second derivative) for the given loss. This new function will support all the signatures thatderiv2()
does.Parameters: loss (Loss) – The loss we want the second-derivative function for.
julia> g2 = deriv2_fun(L2DistLoss())
(::_deriv2) (generic function with 1 method)
julia> g2(-1.0, 3.0) # computes the second derivative of L2DistLoss
2.0
julia> g2.([1.,2], [4,7])
2-element Array{Float64,1}:
2.0
2.0
-
value_deriv_fun
(loss) → Function¶ Returns a new function that computes the
value_deriv()
for the given loss. This new function will support all the signatures thatvalue_deriv()
does.Parameters: loss (Loss) – The loss we want the function for.
julia> fg = value_deriv_fun(L2DistLoss())
(::_value_deriv) (generic function with 1 method)
julia> fg(-1.0, 3.0) # computes the second derivative of L2DistLoss
(16.0,8.0)
Note, however, that these closures cause quite an overhead when executed in the global scope. If you want to use them efficiently, either don’t create them in global scope, or make sure that you pass the closure to some other function before it is used. This way the compiler will most likely inline it and it will be a zero cost abstraction.
julia> f = value_fun(L2DistLoss())
(::_value) (generic function with 1 method)
julia> @code_llvm f(-1.0, 3.0)
define %jl_value_t* @julia__value_70960(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
%3 = alloca %jl_value_t**, align 8
store volatile %jl_value_t** %1, %jl_value_t*** %3, align 8
%ptls_i8 = call i8* asm "movq %fs:0, $0;\0Aaddq $$-2672, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #2
[... many more lines of code ...]
%15 = call %jl_value_t* @jl_f__apply(%jl_value_t* null, %jl_value_t** %5, i32 3)
%16 = load i64, i64* %11, align 8
store i64 %16, i64* %9, align 8
ret %jl_value_t* %15
}
julia> foo(t,y) = (f = value_fun(L2DistLoss()); f(t,y))
foo (generic function with 1 method)
julia> @code_llvm foo(-1.0, 3.0)
define double @julia_foo_71242(double, double) #0 {
top:
%2 = fsub double %1, %0
%3 = fmul double %2, %2
ret double %3
}
Properties of a Loss¶
In some situations it can be quite useful to assert certain properties about a loss-function. One such scenario could be when implementing an algorithm that requires the loss to be strictly convex or Lipschitz continuous. Note that we will only skim over the defintions in most cases. A good treatment of all of the concepts involved can be found in either [BOYD2004] or [STEINWART2008].
[BOYD2004] | Stephen Boyd and Lieven Vandenberghe. “Convex Optimization”. Cambridge University Press, 2004. |
[STEINWART2008] | Steinwart, Ingo, and Andreas Christmann. “Support vector machines”. Springer Science & Business Media, 2008. |
This package uses functions to represent individual properties of a loss. It follows a list of implemented property-functions defined in LearnBase.jl.
-
isconvex
(loss) → Bool¶ Returns true if given loss is a convex function. A function \(f : \mathbb{R}^n \rightarrow \mathbb{R}\) is convex if its domain is a convex set and if for all \(x, y\) in that domain, with \(\theta\) such that for \(0 \leq \theta \leq 1\) , we have
\[f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y)\]Parameters: loss (Loss) – The loss we want to check for convexity.
julia> isconvex(LPDistLoss(0.5))
false
julia> isconvex(ZeroOneLoss())
false
julia> isconvex(L1DistLoss())
true
julia> isconvex(L2DistLoss())
true
-
isstrictlyconvex
(loss) → Bool¶ Returns true if given loss is a strictly convex function. A function \(f : \mathbb{R}^n \rightarrow \mathbb{R}\) is strictly convex if its domain is a convex set and if for all \(x, y\) in that domain where \(x \neq y\), with \(\theta\) such that for \(0 < \theta < 1\) , we have
\[\begin{split}f(\theta x + (1 - \theta) y) < \theta f(x) + (1 - \theta) f(y)\end{split}\]Parameters: loss (Loss) – The loss we want to check for strict convexity.
julia> isstrictlyconvex(L1DistLoss())
false
julia> isstrictlyconvex(LogitDistLoss())
true
julia> isstrictlyconvex(L2DistLoss())
true
-
isstronglyconvex
(loss) → Bool¶ Returns true if given loss is a strongly convex function. A function \(f : \mathbb{R}^n \rightarrow \mathbb{R}\) is \(m\)-strongly convex if its domain is a convex set and if \(\forall x,y \in\) dom \(f\) where \(x \neq y\), and \(\theta\) such that for \(0\) \(\le\) \(\theta\) \(\le\) \(1\) , we have
\[\begin{split}f(\theta x + (1 - \theta)y) < \theta f(x) + (1 - \theta) f(y) - 0.5 m \cdot \theta (1 - \theta) {\| x - y \|}_2^2\end{split}\]In a more familiar setting, if the loss function is differentiable we have
\[\left( \nabla f(x) - \nabla f(y) \right)^\top (x - y) \ge m {\| x - y\|}_2^2\]Parameters: loss (Loss) – The loss we want to check for strong convexity.
julia> isstronglyconvex(L1DistLoss())
false
julia> isstronglyconvex(LogitDistLoss())
false
julia> isstronglyconvex(L1DistLoss())
true
-
isdifferentiable
(loss[, at]) → Bool¶ Returns true if given loss is differentiable (optionally only at the given point if at is specified). A function \(f : \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}\) is differentiable at a point \(x \in\) int dom \(f\) if there exists a matrix \(Df(x)\) in \(\mathbb{R}^{m \times n}\) such that it satisfies:
\[\lim_{z \neq x, z \to x} \frac{{\|f(z) - f(x) - Df(x)(z-x)\|}_2}{{\|z - x\|}_2} = 0\]A function is differentiable if its domain is open and it is differentiable at every point \(x\).
Parameters: - loss (Loss) – The loss we want to check for differentiability.
- at (Number) – Optional. The point x for which it should be checked if the function is differentiable at.
julia> isdifferentiable(L1DistLoss())
false
julia> isdifferentiable(L1DistLoss(), 1)
true
julia> isdifferentiable(L2DistLoss())
true
-
istwicedifferentiable
(loss[, at]) → Bool¶ Returns true if given loss is a twice differentiable function (optionally only at the given point if at is specified). A function \(f : \mathbb{R}^{n} \rightarrow \mathbb{R}\) is said to be twice differentiable at a point \(x \in\) int dom \(f\) if the function derivative for \(\nabla f\) exists at \(x\).
\[\nabla^2 f(x) = D \nabla f(x)\]A function is twice differentiable if its domain is open and it is twice differentiable at every point \(x\).
Parameters: - loss (Loss) – The loss we want to check for differentiability.
- at (Number) – Optional. The point x for which it should be checked if the function is twice differentiable at.
julia> istwicedifferentiable(L1DistLoss())
false
julia> istwicedifferentiable(L1DistLoss())
true
-
isnemitski
(loss) → Bool¶ Returns true if given loss is a Nemitski loss function.
We call a supervised loss function \(L : Y \times \mathbb{R} \rightarrow [0,\infty)\) a Nemitski loss if there exist a measurable function \(b : Y \rightarrow [0, \infty)\) and an increasing function \(h : [0, \infty) \rightarrow [0, \infty)\) such that
\[L(y,\hat{y}) \le b(y) + h(|\hat{y}|), \qquad (y, \hat{y}) \in Y \times \mathbb{R}.\]
-
islipschitzcont
(loss) → Bool¶ Returns true if given loss function is Lipschitz continuous.
A supervised loss function \(L : Y \times \mathbb{R} \rightarrow [0, \infty)\) is Lipschitz continous if there exists a finite constant \(M < \infty\) such that
\[|L(y, t) - L(y, t′)| \le M |t - t′|, \qquad \forall (y, t) \in Y \times \mathbb{R}\]Parameters: loss (Loss) – The loss we want to check for being Lipschitz continuous.
julia> islipschitzcont(SigmoidLoss())
true
julia> islipschitzcont(ExpLoss())
false
-
islocallylipschitzcont
(loss) → Bool¶ Returns true if given loss function is locally-Lipschitz continous.
A supervised loss \(L : Y \times \mathbb{R} \rightarrow [0, \infty)\) is called locally Lipschitz continuous if \(\forall a \ge 0\) there exists a constant \(c_a \ge 0\) such that
\[\sup_{y \in Y} \left| L(y,t) − L(y,t′) \right| \le c_a |t − t′|, \qquad t,t′ \in [−a,a]\]Parameters: loss (Loss) – The loss we want to check for being locally Lipschitz-continous.
julia> islocallylipschitzcont(ExpLoss())
true
julia> islocallylipschitzcont(SigmoidLoss())
true
-
isclipable
(loss) → Bool¶ Returns true if given loss function is clipable. A supervised loss \(L : Y \times \mathbb{R} \rightarrow [0, \infty)\) can be clipped at \(M > 0\) if, for all \((y,t) \in Y \times \mathbb{R}\),
\[L(y, \hat{t}) \le L(y, t)\]where \(\hat{t}\) denotes the clipped value of \(t\) at \(\pm M\). That is
\[\begin{split}\hat{t} = \begin{cases} -M & \quad \text{if } t < -M \\ t & \quad \text{if } t \in [-M, M] \\ M & \quad \text{if } t > M \end{cases}\end{split}\]Parameters: loss (Loss) – The loss we want to check for being clipable.
julia> isclipable(ExpLoss())
false
julia> isclipable(L2DistLoss())
true
-
ismarginbased
(loss) → Bool¶ Returns true if given loss is a margin-based Loss.
A supervised loss function \(L : Y \times \mathbb{R} \rightarrow [0, \infty)\) is said to be margin-based if there exists a representing function \(\psi : \mathbb{R} \rightarrow [0, \infty)\) satisfying
\[L(y, \hat{y}) = \psi (y \cdot \hat{y}), \qquad (y, \hat{y}) \in Y \times \mathbb{R}\]Parameters: loss (Loss) – The loss we want to check for being margin-based.
julia> ismarginbased(HuberLoss(2))
false
julia> ismarginbased(L2MarginLoss())
true
-
isclasscalibrated
(loss) → Bool¶
-
isdistancebased
(loss) → Bool¶ Returns true if given loss is a distance-based Loss.
A supervised loss function \(L : Y \times \mathbb{R} \rightarrow [0, \infty)\) is said to be distance-based if there exists a representing function \(\psi : \mathbb{R} \rightarrow [0, \infty)\) satisfying \(\psi (0) = 0\) and
\[L(y, \hat{y}) = \psi (\hat{y} - y), \qquad (y, \hat{y}) \in Y \times \mathbb{R}\]Parameters: loss (Loss) – The loss we want to check for being distance-based.
julia> isdistancebased(HuberLoss(2))
true
julia> isdistancebased(L2MarginLoss())
false
-
issymmetric
(loss) → Bool¶ Returns true if given loss is a Symmetric Loss.
A function \(f : \mathbb{R} \rightarrow [0,\infty)\) is said to be symmetric about origin if we have
\[f(x) = f(-x), \qquad \forall x \in \mathbb{R}\]A distance-based loss is said to be symmetric if its representing function is symmetric.
Parameters: loss (Loss) – The loss we want to check for being symmetric.
julia> issymmetric(QuantileLoss(0.2))
false
julia> issymetric(LPDistLoss(2))
true
Next we will consider how to average or sum the results of the loss functions more efficiently. The methods described here are implemented in such a way as to avoid allocating a temporary array.
Efficient Sum and Mean¶
In many situations we are not really that interested in the individual loss values (or derivatives) of each observation, but the sum or mean of them; be it weighted or unweighted. For example, by computing the unweighted mean of the loss for our training set, we would effectively compute what is known as the empirical risk. This is usually the quantity (or an important part of it) that the are interesting in minimizing.
When we say “weighted” or “unweighted”, we are referring to
whether we are explicitly specifying the influence of individual
observations on the result. “Weighing” an observation is achieved
by multiplying its value with some number (i.e. the “weight” of
that observation). As a consequence that weighted observation
will have a stronger or weaker influence on the result.
In order to weigh an observation we have to know which array
dimension (if there are more than one) denotes the observations.
On the other hand, for computing an unweighted result we don’t
actually need to know anything about the meaning of the array
dimensions, as long as the targets
and the outputs
are of
compatible shape and size.
The naive way to compute such an unweighted reduction, would be
to call mean
or sum
on the result of the element-wise
operation. The following code snipped show an example of that. We
say “naive”, because it will not give us an acceptable
performance.
julia> value(L1DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
1.0
3.0
5.0
# WARNING: Bad code
julia> sum(value(L1DistLoss(), [1.,2,3], [2,5,-2]))
9.0
This works as expected, but there is a price for it. Before the
sum can be computed, value
will allocate a temporary array
and fill it with the element-wise results. After that, sum
will iterate over this temporary array and accumulate the values
accordingly. Bottom line: we allocate temporary memory that we
don’t need in the end and could avoid.
For that reason we provide special methods that compute the
common accumulations efficiently without allocating temporary
arrays. These methods can be invoked using an additional
parameter which specifies how the values should be accumulated /
averaged. The type of this parameter has to be a subtype of
AverageMode
.
Average Modes¶
Before we discuss these memory-efficient methods, let us briefly
introduce the available average mode types. We provide a number
of different averages modes, all of which are contained within
the namespace AvgMode
. An instance of such type can then be
used as additional parameter to value()
, deriv()
, and
deriv2()
, as we will see further down.
It follows a list of available average modes. Each of which with a short description of what their effect would be when used as an additional parameter to the functions mentioned above.
-
class
AvgMode.
None
¶ Used by default. This will cause the element-wise results to be returned.
-
class
AvgMode.
Sum
¶ Causes the method to return the unweighted sum of the elements instead of the individual elements. Can be used in combination with
ObsDim
, in which case a vector will be returned containing the sum for each observation (useful mainly for multivariable regression).
-
class
AvgMode.
Mean
¶ Causes the method to return the unweighted mean of the elements instead of the individual elements. Can be used in combination with
ObsDim
, in which case a vector will be returned containing the mean for each observation (useful mainly for multivariable regression).
-
class
AvgMode.
WeightedSum
¶ Causes the method to return the weighted sum of all observations. The variable
weights
has to be a vector of the same length as the number of observations. Ifnormalize = true
, the values of the weight vector will be normalized in such as way that they sum to one.-
weights
¶ Vector of weight values that can be used to give certain observations a stronger influence on the sum.
julia> AvgMode.WeightedSum([1,1,2]); # 3 observations
-
normalize
¶ Boolean that specifies if the weight vector should be transformed in such a way that it sums to one (i.e. normalized). This will not mutate the weight vector but instead happen on the fly during the accumulation.
Defaults to
false
. Setting it totrue
only really makes sense in multivalue-regression, otherwise the result will be the same as forWeightedMean
.julia> AvgMode.WeightedSum([1,1,2], normalize = true);
-
-
class
AvgMode.
WeightedMean
¶ Causes the method to return the weighted mean of all observations. The variable
weights
has to be a vector of the same length as the number of observations. Ifnormalize = true
, the values of the weight vector will be normalized in such as way that they sum to one.-
weights
¶ Vector of weight values that can be used to give certain observations a stronger influence on the mean.
julia> AvgMode.WeightedMean([1,1,2]); # 3 observations
-
normalize
¶ Boolean that specifies if the weight vector should be transformed in such a way that it sums to one (i.e. normalized). This will not mutate the weight vector but instead happen on the fly during the accumulation.
Defaults to
true
. Setting it tofalse
only really makes sense in multivalue-regression, otherwise the result will be the same as forWeightedSum
.julia> AvgMode.WeightedMean([1,1,2], normalize = false);
-
Unweighted Sum and Mean¶
As hinted before, we provide special memory efficient methods for
computing the sum or the mean of the element-wise (or
broadcasted) results of value()
, deriv()
, and
deriv2()
. These methods avoid the allocation of a temporary
array and instead compute the result directly.
-
value
(loss, targets, outputs, avgmode) → Number¶ Computes the unweighted sum or mean (depending on avgmode) of the individual values of the loss function for each pair in targets and outputs. This method will not allocate a temporary array.
In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
Returns: The unweighted sum or mean of the individual values of the loss function for all values in targets and outputs.
Return type: Number
- loss (
julia> value(L1DistLoss(), [1,2,3], [2,5,-2], AvgMode.Sum())
9
julia> value(L1DistLoss(), [1.,2,3], [2,5,-2], AvgMode.Sum())
9.0
julia> value(L1DistLoss(), [1,2,3], [2,5,-2], AvgMode.Mean())
3.0
julia> value(L1DistLoss(), Float32[1,2,3], Float32[2,5,-2], AvgMode.Mean())
3.0f0
The exact same method signature is also implemented for
deriv()
and deriv2()
respectively.
-
deriv
(loss, targets, outputs, avgmode) → Number¶ Computes the unweighted sum or mean (depending on avgmode) of the individual derivatives of the loss function for each pair in targets and outputs. This method will not allocate a temporary array.
In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
Returns: The unweighted sum or mean of the individual derivatives of the loss function for all values in targets and outputs.
Return type: Number
- loss (
julia> deriv(L2DistLoss(), [1,2,3], [2,5,-2], AvgMode.Sum())
-2
julia> deriv(L2DistLoss(), [1,2,3], [2,5,-2], AvgMode.Mean())
-0.6666666666666665
-
deriv2
(loss, targets, outputs, avgmode) → Number¶ Computes the unweighted sum or mean (depending on avgmode) of the individual 2nd derivatives of the loss function for each pair in targets and outputs. This method will not allocate a temporary array.
In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
Returns: The unweighted sum or mean of the individual 2nd derivatives of the loss function for all values in targets and outputs.
Return type: Number
- loss (
julia> deriv2(LogitDistLoss(), [1.,2,3], [2,5,-2], AvgMode.Sum())
0.49687329928636825
julia> deriv2(LogitDistLoss(), [1.,2,3], [2,5,-2], AvgMode.Mean())
0.1656244330954561
Sum and Mean per Observation¶
When the targets and predicted outputs are multi-dimensional arrays instead of vectors, we may be interested in accumulating the values over all but one dimension. This is typically the case when we work in a multi-variable regression setting, where each observation has multiple outputs and thus multiple targets. In those scenarios we may be more interested in the average loss for each observation, rather than the total average over all the data.
To be able to accumulate the values for each observation
separately, we have to know and explicitly specify the dimension
that denotes the observations. For that purpose we provide the
types contained in the namespace ObsDim
.
-
value
(loss, targets, outputs, avgmode, obsdim) → Vector Computes the values of the loss function for each pair in targets and outputs individually, and returns either the unweighted sum or mean for each observation (depending on avgmode). This method will not allocate a temporary array, but it will allocate the resulting vector.
Both arrays have to be of the same shape and size. Furthermore they have to have at least two array dimensions (i.e. they must not be vectors).
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The multi-dimensional array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The multi-dimensional array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
- obsdim (ObsDimension) – Specifies which of the array
dimensions denotes the observations.
see
?ObsDim
for more information.
Returns: A vector that contains the unweighted sums / means of the loss for each observation in targets and outputs.
Return type: Vector
- loss (
Consider the following two matrices, targets
and outputs
.
There are two ways to interpret the shape of these arrays if one
dimension is to denote the observations.
julia> targets = rand(2,4)
2×4 Array{Float64,2}:
0.0743675 0.285303 0.247157 0.223666
0.513145 0.59224 0.32325 0.989964
julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
0.6335 0.319131 0.637087 0.613777
0.513495 0.264587 0.533555 0.714688
The first interpretation would be to say that the first dimension denotes the observations. Thus this data would consist of two observations with four variables each.
julia> value(L1DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
1.373
0.813583
julia> value(L1DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.First())
2-element Array{Float64,1}:
0.34325
0.203396
The second possible interpretation would be to say that the second/last dimension denotes the observations. In that case our data consists of four observations with two variables each.
julia> value(L1DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.Last())
4-element Array{Float64,1}:
0.559482
0.36148
0.600235
0.665386
julia> value(L1DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.Last())
4-element Array{Float64,1}:
0.279741
0.18074
0.300118
0.332693
Because this method returns a vector of values, we also provide a mutating version that can make use a preallocated vector to write the results into.
-
value!
(buffer, loss, targets, outputs, avgmode, obsdim) → Vector¶ Computes the values of the loss function for each pair in targets and outputs individually, and returns either the unweighted sum or mean for each observation, depending on avgmode. The results are stored into the given vector buffer. This method will not allocate a temporary array.
Both arrays have to be of the same shape and size. Furthermore they have to have at least two array dimensions (i.e. so they must not be vectors).
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - buffer (AbstractVector) – Array to store the computed values in. Old values will be overwritten and lost.
- loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The multi-dimensional array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The multi-dimensional array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
- obsdim (ObsDimension) – Specifies which of the array
dimensions denotes the observations.
see
?ObsDim
for more information.
Returns: buffer (for convenience).
julia> buffer = zeros(2);
julia> value!(buffer, L1DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
1.373
0.813583
julia> value!(buffer, L1DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.First())
2-element Array{Float64,1}:
0.34325
0.203396
julia> buffer = zeros(4);
julia> value!(buffer, L1DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.Last())
4-element Array{Float64,1}:
0.559482
0.36148
0.600235
0.665386
julia> value!(buffer, L1DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.Last())
4-element Array{Float64,1}:
0.279741
0.18074
0.300118
0.332693
We also provide both of these methods for deriv()
and
deriv2()
respectively.
-
deriv
(loss, targets, outputs, avgmode, obsdim) → Vector
Same as below, but using the 1st derivative.
-
deriv2
(loss, targets, outputs, avgmode, obsdim) → Vector Computes the (2nd) derivatives of the loss function for each pair in targets and outputs individually and returns either the unweighted sum or mean for each observation (depending on avgmode). This method will not allocate a temporary array, but it will allocate the resulting vector.
Both arrays have to be of the same shape and size. Furthermore they have to have at least two array dimensions (i.e. so they must not be vectors).
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The multi-dimensional array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The multi-dimensional array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
- obsdim (ObsDimension) – Specifies which of the array
dimensions denotes the observations.
see
?ObsDim
for more information.
Returns: A vector that contains the unweighted sums / means of the (2nd) loss-derivatives for each observation in targets and outputs.
Return type: Vector
- loss (
julia> targets = rand(2,4)
2×4 Array{Float64,2}:
0.0743675 0.285303 0.247157 0.223666
0.513145 0.59224 0.32325 0.989964
julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
0.6335 0.319131 0.637087 0.613777
0.513495 0.264587 0.533555 0.714688
julia> deriv(L2DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
2.746
-0.784548
julia> deriv(L2DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.First())
2-element Array{Float64,1}:
0.686501
-0.196137
julia> deriv(L2DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.Last())
4-element Array{Float64,1}:
1.11896
-0.58765
1.20047
0.22967
julia> deriv(L2DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.Last())
4-element Array{Float64,1}:
0.559482
-0.293825
0.600235
0.114835
Because this method returns a vector of values, we also provide a mutating version that can make use a preallocated vector to write the results into.
-
deriv!
(buffer, loss, targets, outputs, avgmode, obsdim) → Vector¶ Same as below, but using the 1st derivative.
-
deriv2!
(buffer, loss, targets, outputs, avgmode, obsdim) → Vector¶ Computes the (2nd) derivatives of the loss function for each pair in targets and outputs individually, and returns either the unweighted sums or means for each observation, depending on avgmode. The results are stored into the given vector buffer. This method will not allocate a temporary array.
Both arrays have to be of the same shape and size. Furthermore they have to have at least two array dimensions (i.e. so they must not be vectors).
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - buffer (AbstractVector) – Array to store the computed values in. Old values will be overwritten and lost.
- loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The multi-dimensional array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The multi-dimensional array of predicted outputs \(\mathbf{\hat{y}}\).
- avgmode (AverageMode) – Must either be
AvgMode.Sum()
orAvgMode.Mean()
- obsdim (ObsDimension) – Specifies which of the array
dimensions denotes the observations.
see
?ObsDim
for more information.
Returns: buffer (for convenience).
julia> buffer = zeros(2);
julia> deriv!(buffer, L2DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
2.746
-0.784548
julia> deriv!(buffer, L2DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.First())
2-element Array{Float64,1}:
0.686501
-0.196137
julia> buffer = zeros(4);
julia> deriv!(buffer, L2DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.Last())
4-element Array{Float64,1}:
1.11896
-0.58765
1.20047
0.22967
julia> deriv!(buffer, L2DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.Last())
4-element Array{Float64,1}:
0.559482
-0.293825
0.600235
0.114835
Weighted Sum and Mean¶
Up to this point, all the averaging was performed in an unweighted manner. That means that each observation was treated as equal and had thus the same potential influence on the result. In this sub-section we will consider the situations in which we do want to explicitly specify the influence of each observation (i.e. we want to weigh them). When we say we “weigh” an observation, what it effectively boils down to is multiplying the result for that observation (i.e. the computed loss or derivative) with some number. This is done for every observation individually.
To get a better understand of what we are talking about, let us
consider performing a weighting scheme manually. The following
code will compute the loss for three observations, and then
multiply the result of the second observation with the number
2
, while the other two remains as they are. If we then sum up
the results, we will see that the loss of the second observation
was effectively counted twice.
julia> result = value.(L1DistLoss(), [1.,2,3], [2,5,-2]) .* [1,2,1]
3-element Array{Float64,1}:
1.0
6.0
5.0
julia> sum(result)
12.0
The point of weighing observations is to inform the learning algorithm we are working with, that it is more important to us to predict some observations correctly than it is for others. So really, the concrete weight-factor matters less than the ratio between the different weights. In the example above the second observation was thus considered twice as important as any of the other two observations.
In the case of multi-dimensional arrays the process isn’t that simple anymore. In such a scenario, computing the weighted sum (or weighted mean) can be thought of as having an additional step. First we either compute the sum or (unweighted) average for each observation (which results in a vector), and then we compute the weighted sum of all observations.
The following code snipped demonstrates how to compute the
AvgMode.WeightedSum([2,1])
manually. This is not meant as
an example of how to do it, but simply to show what is happening
qualitatively. In this example we assume that we are working in a
multi-variable regression setting, in which our data set has four
observations with two target-variables each.
julia> targets = rand(2,4)
2×4 Array{Float64,2}:
0.0743675 0.285303 0.247157 0.223666
0.513145 0.59224 0.32325 0.989964
julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
0.6335 0.319131 0.637087 0.613777
0.513495 0.264587 0.533555 0.714688
# WARNING: BAD CODE - ONLY FOR ILLUSTRATION
julia> tmp = sum(value.(L1DistLoss(), targets, outputs),2) # assuming ObsDim.First()
2×1 Array{Float64,2}:
1.373
0.813584
julia> sum(tmp .* [2, 1]) # weigh 1st observation twice as high
3.559587
To manually compute the result for
AvgMode.WeightedMean([2,1])
we follow a similar approach, but
use the normalized weight vector in the last step.
# WARNING: BAD CODE - ONLY FOR ILLUSTRATION
julia> tmp = mean(value.(L1DistLoss(), targets, outputs),2) # ObsDim.First()
2×1 Array{Float64,2}:
0.34325
0.203396
julia> sum(tmp .* [0.6666, 0.3333]) # weigh 1st observation twice as high
0.29660258677499995
Note that you can specify explicitly if you want to normalize the
weight vector. That option is supported for computing the
weighted sum, as well as for computing the weighted mean. See the
documentation for AvgMode.WeightedSum
and
AvgMode.WeightedMean
for more information.
The code-snippets above are of course very inefficient, because
they allocate (multiple) temporary arrays. We only included them
to demonstrate what is happening in terms of desired result /
effect. For doing those computations efficiently we provide
special methods for value()
, deriv()
, deriv2()
and their mutating counterparts.
-
value
(loss, targets, outputs, wavgmode[, obsdim]) → Number Computes the values of the loss function for each pair in targets and outputs individually and returns either the weighted sum or mean for each observation (depending on wavgmode). This method will not allocate a temporary array. Both arrays have to be of the same shape and size.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
- wavgmode (AverageMode) – Must either be of type
AvgMode.WeightedSum
orAvgMode.WeightedMean
. Either way, the specified weight vector must have the same number of observations as targets and outputs. - obsdim (ObsDimension) – Optional. Default to
ObsDim.Last()
. Specifies which of the array dimensions denotes the observations. see?ObsDim
for more information.
Returns: A vector that contains the unweighted sums / means of the loss for each observation in targets and outputs.
Return type: Number
- loss (
julia> value(L1DistLoss(), [1.,2,3], [2,5,-2], AvgMode.WeightedSum([1,2,1]))
12.0
julia> value(L1DistLoss(), [1.,2,3], [2,5,-2], AvgMode.WeightedMean([1,2,1]))
3.0
julia> targets = rand(2,4)
2×4 Array{Float64,2}:
0.0743675 0.285303 0.247157 0.223666
0.513145 0.59224 0.32325 0.989964
julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
0.6335 0.319131 0.637087 0.613777
0.513495 0.264587 0.533555 0.714688
julia> value(L1DistLoss(), targets, outputs, AvgMode.WeightedSum([2,1]), ObsDim.First())
3.5595869999999996
julia> value(L1DistLoss(), targets, outputs, AvgMode.WeightedMean([2,1]), ObsDim.First())
0.29663224999999993
We also provide both of these methods for deriv()
and
deriv2()
respectively.
-
deriv
(loss, targets, outputs, wavgmode[, obsdim]) → Number
Same as below, but using the 1st derivative.
-
deriv2
(loss, targets, outputs, wavgmode[, obsdim]) → Number Computes the (2nd) derivatives of the loss function for each pair in targets and outputs individually and returns either the weighted sum or mean for each observation (depending on wavgmode). This method will not allocate a temporary array. Both arrays have to be of the same shape and size.
Note: This function should always be type-stable. If it isn’t, you likely found a bug.
Parameters: - loss (
SupervisedLoss
) – The loss-function we are interested in. - targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
- outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
- wavgmode (AverageMode) – Must either be of type
AvgMode.WeightedSum
orAvgMode.WeightedMean
. Either way, the specified weight vector must have the same number of observations as targets and outputs. - obsdim (ObsDimension) – Optional. Default to
ObsDim.Last()
. Specifies which of the array dimensions denotes the observations. see?ObsDim
for more information.
Returns: A vector that contains the unweighted sums / means of the loss-derivatives for each observation in targets and outputs.
Return type: Number
- loss (
julia> deriv(L2DistLoss(), [1.,2,3], [2,5,-2], AvgMode.WeightedSum([1,2,1]))
4.0
julia> deriv(L2DistLoss(), [1.,2,3], [2,5,-2], AvgMode.WeightedMean([1,2,1]))
1.0
julia> targets = rand(2,4)
2×4 Array{Float64,2}:
0.0743675 0.285303 0.247157 0.223666
0.513145 0.59224 0.32325 0.989964
julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
0.6335 0.319131 0.637087 0.613777
0.513495 0.264587 0.533555 0.714688
julia> deriv(L2DistLoss(), targets, outputs, AvgMode.WeightedSum([2,1]), ObsDim.First())
4.707458000000001
julia> value(L2DistLoss(), targets, outputs, AvgMode.WeightedMean([2,1]), ObsDim.First())
0.12194772056937497
Available Loss Functions¶
Aside from the interface, this package also provides a number of popular (and not so popular) loss functions out-of-the-box. Great effort has been put into ensuring a correct, efficient, and type-stable implementation for those. Most of them either belong to the family of distance-based or margin-based losses. These two categories are also indicative for if a loss is intended for regression or classification problems
Loss Functions for Regression¶
Loss functions that belong to the category “distance-based” are primarily used in regression problems. They utilize the numeric difference between the predicted output and the true target as a proxy variable to quantify the quality of individual predictions.
Distance-based Losses¶This section lists all the subtypes of LPDistLoss¶
L1DistLoss¶
L2DistLoss¶
LogitDistLoss¶
HuberLoss¶
L1EpsilonInsLoss¶
L2EpsilonInsLoss¶
PeriodicLoss¶
QuantileLoss¶
Note You may note that our definition of the QuantileLoss looks different to what one usually sees in other literature. The reason is that we have to correct for the fact that in our case \(r = \hat{y} - y\) instead of \(r_{\textrm{usual}} = y - \hat{y}\), which means that our definition relates to that in the manner of \(r = -1 * r_{\textrm{usual}}\). |
Loss Functions for Classification¶
Margin-based loss functions are particularly useful for binary classification. In contrast to the distance-based losses, these do not care about the difference between true target and prediction. Instead they penalize predictions based on how well they agree with the sign of the target.
Margin-based Losses¶This section lists all the subtypes of ZeroOneLoss¶
PerceptronLoss¶
L1HingeLoss¶
SmoothedL1HingeLoss¶
ModifiedHuberLoss¶
DWDMarginLoss¶
L2MarginLoss¶
L2HingeLoss¶
LogitMarginLoss¶
ExpLoss¶
SigmoidLoss¶
|
Internals¶
If you are interested in contributing to LossFunctions.jl, or simply want to understand how and why the package does then take a look at our developer documentation.
Developer Documentation¶
Abstract Superclasses¶
Most of the implemented losses fall under the category of supervised losses. In other words they represent functions with two parameters (the true targets and the predicted outcomes) to compute their value.
-
class
SupervisedLoss
¶ Abstract subtype of
Loss
. A loss is considered supervised, if all the information needed to computevalue(loss, features, targets, outputs)
are contained intargets
andoutputs
, and thus allows for the simplificationvalue(loss, targets, outputs)
.
-
class
DistanceLoss
¶ Abstract subtype of
SupervisedLoss
. A supervised loss that can be simplified toL(targets, outputs) = L(targets - outputs)
is considered distance-based.
-
class
MarginLoss
¶ Abstract subtype of
SupervisedLoss
. A supervised loss, where the targets are in {-1, 1}, and which can be simplified toL(targets, outputs) = L(targets * outputs)
is considered margin-based.
Shared Interface¶
-
value
(loss, difference) Computes the value of the loss function for each observation in
difference
individually and returns the result as an array of the same size as the parameter.Parameters: - loss (
DistanceLoss
) – An instance of the loss we are interested in. - difference (
AbstractArray
) – The result of subtracting the true targets from the predicted outputs.
Returns: The value of the loss function for the elements in
difference
.Return type: AbstractArray
- loss (
-
deriv
(loss, difference) Computes the derivative of the loss function for each observation in
difference
individually and returns the result as an array of the same size as the parameter.Parameters: - loss (
DistanceLoss
) – An instance of the loss we are interested in. - difference (
AbstractArray
) – The result of subtracting the true targets from the predicted outputs.
Returns: The derivatives of the loss function for the elements in
difference
.Return type: AbstractArray
- loss (
Regression vs Classification¶
We can further divide the supervised losses into two useful
sub-categories: DistanceLoss
for regression and
MarginLoss
for classification.
Losses for Regression¶
Supervised losses that can be expressed as a univariate function
of output - target
are referred to as distance-based losses.
value(L2DistLoss(), difference)
Distance-based losses are typically utilized for regression problems.
That said, there are also other losses that are useful for
regression problems that don’t fall into this category, such as
the PeriodicLoss
.
Note
In the literature that this package is partially based on,
the convention for the distance-based losses is target - output
(see [STEINWART2008] p. 38).
We chose to diverge from this definition because it would force
a difference between the results for the unary and the binary
version of the derivative.
Losses for Classification¶
Margin-based losses are supervised losses where the values of the
targets are restricted to be in \(\{1,-1\}\), and which can
be expressed as a univariate function output * target
.
value(L1HingeLoss(), agreement)
Note
Throughout the codebase we refer to the result of
output * target
as agreement
.
The discussion that lead to this convention can be found
issue #9
Margin-based losses are usually used for binary classification. In contrast to other formalism, they do not natively provide probabilities as output.
Deviations from Literature¶
Writing Tests¶
Indices and tables¶
Acknowledgements¶
The basic design of this package is heavily modelled after the loss-related definitions in [STEINWART2008].
We would also like to mention that some early inspiration was drawn from EmpiricalRisks.jl
References¶
[STEINWART2008] | Steinwart, Ingo, and Andreas Christmann. “Support vector machines”. Springer Science & Business Media, 2008. |
LICENSE¶
The LossFunctions.jl package is licensed under the MIT “Expat” License
see LICENSE.md in the Github repository.