LossFunctions.jl’s documentation

This package represents a community effort to centralize the definition and implementation of loss functions in Julia. As such, it is a part of the JuliaML ecosystem.

The sole purpose of this package is to provide an efficient and extensible implementation of various loss functions used throughout Machine Learning (ML). It is thus intended to serve as a special purpose back-end for other ML libraries that require losses to accomplish their tasks. To that end we provide a considerable amount of carefully implemented loss functions, as well as an API to query their properties (e.g. convexity). Furthermore, we expose methods to compute their values, derivatives, and second derivatives for single observations as well as arbitrarily sized arrays of observations. In the case of arrays a user additionally has the ability to define if and how element-wise results are averaged or summed over.

From an end-user’s perspective one normally does not need to import this package directly. That said, it should provide a decent starting point for any student that is interested in investigating the properties or behaviour of loss functions.

Where to begin?

If this is the first time you consider using LossFunctions for your machine learning related experiments or packages, make sure to check out the “Getting Started” section.

Getting Started

LossFunctions.jl is the result of a collaborative effort to design and implement an efficient but also convenient-to-use Julia library for, well, loss functions. As such, this package implements the functionality needed to query various properties about a loss function (such as convexity), as well as a number of methods to compute its value, derivative, and second derivative for single observations or arrays of observations.

In this section we will provide a condensed overview of the package. In order to keep this overview concise, we will not discuss any background information or theory on the losses here in detail.

Installation

To install LossFunctions.jl, start up Julia and type the following code-snipped into the REPL. It makes use of the native Julia package manger.

Pkg.add("LossFunctions")

Additionally, for example if you encounter any sudden issues, or in the case you would like to contribute to the package, you can manually choose to be on the latest (untagged) version.

Pkg.checkout("LossFunctions")

Overview

Let us take a look at a few examples to get a feeling of how one can use this library. This package is registered in the Julia package ecosystem. Once installed the package can be imported as usual.

using LossFunctions

Typically, the losses we work with in Machine Learning are multivariate functions of two variables, the true target \(y\), which represents the “ground truth” (i.e. correct answer), and the predicted output \(\hat{y}\), which is what our model thinks the truth is. All losses that can be expressed in this way will be referred to as supervised losses. The true targets are often expected to be of a specific set (e.g. \(\{1,-1\}\) in classification), which we will refer to as \(Y\), while the predicted outputs may be any real number. So for our purposes we can define a supervised loss as follows

\[L : Y \times \mathbb{R} \rightarrow [0,\infty)\]

Such a loss function takes these two variables as input and returns a value that quantifies how “bad” our prediction is in comparison to the truth. In other words: the lower the loss, the better the prediction.

From an implementation perspective, we should point out that all the concrete loss “functions” that this package provides are actually defined as immutable types, instead of native Julia functions. We can compute the value of some type of loss using the function value(). Let us start with an example of how to compute the loss of a single observation (i.e. two numbers).

#                loss       y    ŷ
julia> value(L2DistLoss(), 1.0, 0.5)
0.25

Calling the same function using arrays instead of numbers will return the element-wise results, and thus basically just serve as a wrapper for broadcast (which by the way is also supported).

julia> true_targets = [  1,  0, -2];

julia> pred_outputs = [0.5,  2, -1];

julia> value(L2DistLoss(), true_targets, pred_outputs)
3-element Array{Float64,1}:
 0.25
 4.0
 1.0

Alternatively, one can also use an instance of a loss just like one would use any other Julia function. This can make the code significantly more readable while not impacting performance, as it is a zero-cost abstraction (i.e. it compiles down to the same code).

julia> loss = L2DistLoss()
LossFunctions.LPDistLoss{2}()

julia> loss(true_targets, pred_outputs) # same result as above
3-element Array{Float64,1}:
 0.25
 4.0
 1.0

julia> loss(1, 0.5f0) # single observation
0.25f0

If you are not actually interested in the element-wise results individually, but some accumulation of those (such as mean or sum), you can additionally specify an average mode. This will avoid allocating a temporary array and directly compute the result.

julia> value(L2DistLoss(), true_targets, pred_outputs, AvgMode.Sum())
5.25

julia> value(L2DistLoss(), true_targets, pred_outputs, AvgMode.Mean())
1.75

Aside from these standard unweighted average modes, we also provide weighted alternatives. These expect a weight-factor for each observation in the predicted outputs and so allow to give certain observations a stronger influence over the result.

julia> value(L2DistLoss(), true_targets, pred_outputs, AvgMode.WeightedSum([2,1,1]))
5.5

julia> value(L2DistLoss(), true_targets, pred_outputs, AvgMode.WeightedMean([2,1,1]))
1.375

We do not restrict the targets and outputs to be vectors, but instead allow them to be arrays of any arbitrary shape. The shape of an array may or may not have an interpretation that is relevant for computing the loss. Consequently, those methods that don’t require this information can be invoked using the same method signature as before, because the results are simply computed element-wise or accumulated.

julia> A = rand(2,3)
2×3 Array{Float64,2}:
 0.0939946  0.97639   0.568107
 0.183244   0.854832  0.962534

julia> B = rand(2,3)
2×3 Array{Float64,2}:
 0.0538206  0.77055  0.996922
 0.598317   0.72043  0.912274

julia> value(L2DistLoss(), A, B)
2×3 Array{Float64,2}:
 0.00161395  0.0423701  0.183882
 0.172286    0.0180639  0.00252607

julia> value(L2DistLoss(), A, B, AvgMode.Sum())
0.420741920634

These methods even allow arrays of different dimensionality, in which case broadcast is performed. This also applies to computing the sum and mean, in which case we use custom broadcast implementations that avoid allocating a temporary array.

julia> value(L2DistLoss(), rand(2), rand(2,2))
2×2 Array{Float64,2}:
 0.228077  0.597212
 0.789808  0.311914

julia> value(L2DistLoss(), rand(2), rand(2,2), AvgMode.Sum())
0.0860658081865589

That said, it is possible to explicitly specify which dimension denotes the observations. This is particularly useful for multivariate regression where one could want to accumulate the loss per individual observation.

julia> value(L2DistLoss(), A, B, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
 0.227866
 0.192876

julia> value(L2DistLoss(), A, B, AvgMode.Sum(), ObsDim.Last())
3-element Array{Float64,1}:
 0.1739
 0.060434
 0.186408

julia> value(L2DistLoss(), A, B, AvgMode.WeightedSum([2,1]), ObsDim.First())
0.648608280735

All these function signatures of value() also apply for computing the derivatives using deriv() and the second derivatives using deriv2().

julia> true_targets = [  1,  0, -2];

julia> pred_outputs = [0.5,  2, -1];

julia> deriv(L2DistLoss(), true_targets, pred_outputs)
3-element Array{Float64,1}:
 -1.0
  4.0
  2.0

julia> deriv2(L2DistLoss(), true_targets, pred_outputs)
3-element Array{Float64,1}:
 2.0
 2.0
 2.0

Additionally, we provide mutating versions for the subset of methods that return an array. These have the same function signatures with the only difference of requiring an additional parameter as the first argument. This variable should always be the preallocated array that is to be used as storage.

julia> buffer = zeros(3)
3-element Array{Float64,1}:
 0.0
 0.0
 0.0

julia> deriv!(buffer, L2DistLoss(), true_targets, pred_outputs)
3-element Array{Float64,1}:
 -1.0
  4.0
  2.0

Getting Help

To get help on specific functionality you can either look up the information here, or if you prefer you can make use of Julia’s native doc-system. The following example shows how to get additional information on L1HingeLoss within Julia’s REPL:

?L1HingeLoss
search: L1HingeLoss SmoothedL1HingeLoss

  L1HingeLoss <: MarginLoss

   The hinge loss linearly penalizes every predicition where the resulting
   agreement <= 1 . It is Lipschitz continuous and convex, but not strictly
   convex.

 L(y, ŷ) = max(0, 1 - y⋅ŷ)

             Lossfunction                     Derivative
     ┌────────────┬────────────┐      ┌────────────┬────────────┐
   3 │'\.                      │    0 │                  ┌------│
     │  ''_                    │      │                  |      │
     │     \.                  │      │                  |      │
     │       '.                │      │                  |      │
   L │         ''_             │   L' │                  |      │
     │            \.           │      │                  |      │
     │              '.         │      │                  |      │
   0 │                ''_______│   -1 │------------------┘      │
     └────────────┴────────────┘      └────────────┴────────────┘
     -2                        2      -2                        2
                y ⋅ ŷ                            y ⋅ ŷ

If you find yourself stuck or have other questions concerning the package you can find us at gitter or the Machine Learning domain on discourse.julialang.org

If you encounter a bug or would like to participate in the further development of this package come find us on Github.

Introduction and Motivation

If you are new to Machine Learning in Julia, or are simply interested in how and why this package works the way it works, feel free to take a look at the following sections. There we discuss the concepts involved and outline the most important terms and definitions.

Background and Motivation

In this section we will discuss the concept “loss function” in more detail. We will start by introducing some terminology and definitions. However, please note that we won’t attempt to give a complete treatment of loss functions and the math involved (unlike a book or a lecture could do). So this section won’t be a substitution for proper literature on the topic. While we will try to cover all the basics necessary to get a decent intuition of the ideas involved, we do assume basic knowledge about Machine Learning.

Warning

This section and its sub-sections serve soley as to explain the underyling theory and concepts and further to motivate the solution provided by this package. As such, this section is not intended as a guide on how to apply this package.

Terminology

To start off, let us go over some basic terminology. In Machine Learning (ML) we are primarily interested in automatically learning meaningful patterns from data. For our purposes it suffices to say that in ML we try to teach the computer to solve a task by induction rather than by definition. This package is primarily concerned with the subset of Machine Learning that falls under the umbrella of Supervised Learning. There we are interested in teaching the computer to predict a specific output for some given input. In contrast to unsupervised learning the teaching process here involves showing the computer what the predicted output is supposed to be; i.e. the “true answer” if you will.

How is this relevant for this package? Well, it implies that we require some meaningful way to show the true answers to the computer so that it can learn from “seeing” them. More importantly, we have to somehow put the true answer into relation to what the computer currently predicts the answer should be. This would provide the basic information needed for the computer to be able to improve; that is what loss functions are for.

When we say we want our computer to learn something that is able to make predictions, we are talking about a prediction function, denoted as \(h\) and sometimes called “fitted hypothesis”, or “fitted model”. Note that we will avoid the term hypothesis for the simple reason that it is widely used in statistics for something completely different. We don’t consider a prediction function as the same thing as a prediction model, because we think of a prediction model as a family of prediction functions. What that boils down to is that the prediction model represents the set of possible prediction functions, while the final prediction function is the chosen function that best solves the problem. So in a way a prediction model can be thought of as the manifestation of our assumptions about the problem, because it restricts the solution to a specific family of functions. For example a linear prediction model for two features represents all possible linear functions that have two coefficients. A prediction function would in that scenario be a concrete linear function with a particular fixed set of coefficients.

The purpose of a prediction function is to take some input and produce a corresponding output. That output should be as faithful as possible to the true answer. In the context of this package we will refer to the “true answer” as the true target, or short “target”. During training, and only during training, inputs and targets can both be considered as part of our data set. We say “only during training” because in a production setting we don’t actually have the targets available to us (otherwise there would be no prediction problem to solve in the first place). In essence we can think of our data as two entities with a 1-to-1 connection in each observation, the inputs, which we call features, and the corresponding desired outputs, which we call true targets.

Let us be a little more concrete with the two terms we really care about in this package.

True Targets

A true target (singular) represents the “desired” output for the input features of the observation. The targets are often referred to as “ground truth” and we will denote a single target as \(y \in Y\). When we talk about an array (e.g. a vector) of targets, we will print it in bold as \(\mathbf{y}\). What the set \(Y\) is will depend on the subdomain of supervised learning that you are working in.

  • Real-valued Regression: \(Y \subseteq \mathbb{R}\).
  • Multioutput Regression: \(Y \subseteq \mathbb{R}^k\).
  • Margin-based Classification: \(Y = \{1,-1\}\).
  • Probabilistic Classification: \(Y = \{1,0\}\).
  • Multiclass Classification: \(Y = \{1,2,\dots,k\}\)

See MLLabelUtils for more information on classification targets.

Predicted Outputs

A predicted output (singular) is the result of our prediction function given the features of some observation. We will denote a single output as \(\hat{y} \in \mathbb{R}\) (pronounced as “why hat”). When we talk about an array of outputs, we will print it in bold as \(\mathbf{\hat{y}}\). Note something unintuitive but important: The variables \(y\) and \(\hat{y}\) don’t have to be of the same set. Even in a classification setting where \(y \in \{1,-1\}\), it is typical that \(\hat{y} \in \mathbb{R}\).

The fact that in classification the predictions can be fundamentally different than the targets is important to know. The reason for restricting the targets to specific numbers when doing classification is mathematical convenience for loss functions. So loss functions have this knowledge build in.

In a classification setting, the predicted outputs and the true targets are usually of different form and type. For example, in margin-based classification it could be the case that the target \(y=-1\) and the predicted output \(\hat{y} = -1000\). It would seem that the prediction is not really reflecting the target properly, but in this case we would actually have a perfectly correct prediction. This is because in margin-based classification the main thing that matters about the predicted output is that the sign agrees with the true target.

Definitions

We base most of our definitions on the work presented in [STEINWART2008]. Note, however, that we will adapt or simplify in places at our discretion. We do this in situations where it makes sense to us considering the scope of this package or because of implementation details.

Let us again consider the term prediction function. More formally, a prediction function \(h\) is a function that maps an input from the feature space \(X\) to the real numbers \(\mathbb{R}\). So invoking \(h\) with some features \(x \in X\) will produce the prediction \(\hat{y} \in \mathbb{R}\).

\[h : X \rightarrow \mathbb{R}\]

This resulting prediction \(\hat{y}\) is what we want to compare to the target \(y\) in order to asses how bad the prediction is. The function we use for such an assessment will be of a family of functions we refer to as supervised losses. We think of a supervised loss as a function of two parameters, the true target \(y \in Y\) and the predicted output \(\hat{y} \in \mathbb{R}\). The result of computing such a loss will be a non-negative real number. The larger the value of the loss, the worse the prediction.

\[L : Y \times \mathbb{R} \rightarrow [0,\infty)\]

Note a few interesting things about supervised loss functions.

  • The absolute value of a loss is often (but not always) meaningless and doesn’t offer itself to a useful interpretation. What we usually care about is that the loss is as small as it can be.
  • In general the loss function we use is not the function we are actually interested in minimizing. Instead we are minimizing what is referred to as a “surrogate”. For binary classification for example we are really interested in minimizing the ZeroOne loss (which simply counts the number of misclassified predictions). However, that loss is difficult to minimize given that it is not convex nor continuous. That is why we use other loss functions, such as the hinge loss or logistic loss. Those losses are “classification calibrated”, which basically means they are good enough surrogates to solve the same problem. Additionally, surrogate losses tend to have other nice properties.
  • For classification it does not need to be the case that a “correct” prediction has a loss of zero. In fact some classification calibrated losses are never truly zero.

Alternative Viewpoints

While the term “loss function” is usually used in the same context throughout the literature, the specifics differ from one textbook to another. For that reason we would like to mention alternative definitions of what a “loss function” is. Note that we will only give a partial and thus very simplified description of these. Please refer to the listed sources for more specifics.

In [SHALEV2014] the authors consider a loss function as a higher-order function of two parameters, a prediction model and an observation tuple. So in that definition a loss function and the prediction function are tightly coupled. This way of thinking about it makes a lot of sense, considering the process of how a prediction model is usually fit to the data. For gradient descent to do its job it needs the, well, gradient of the empirical risk. This gradient is computed using the chain rule for the inner loss and the prediction model. If one views the loss and the prediction model as one entity, then the gradient can sometimes be simplified immensely. That said, we chose to not follow this school of thought, because from a software-engineering standpoint it made more sense to us to have small modular pieces. So in our implementation the loss functions don’t need to know that prediction functions even exist. This makes the package easier to maintain, test, and reason with. Given Julia’s ability for multiple dispatch we don’t even lose the ability to simplify the gradient if need be.

[SHALEV2014]Shalev-Shwartz, Shai, and Shai Ben-David. “Understanding machine learning: From theory to algorithms”. Cambridge University Press, 2014.

API Documentation

This section gives a more detailed treatment of the exposed functions and their available methods. We will start by describing how to instantiate a loss, as well as the basic interface that all loss functions share.

Working with Losses

Even though they are called loss “functions”, this package implements them as immutable types instead of true Julia functions. There are good reasons for that. For example it allows us to specify the properties of losse functions explicitly (e.g. isconvex(myloss)). It also makes for a more consistent API when it comes to computing the value or the derivative. Some loss functions even have additional parameters that need to be specified, such as the \(\epsilon\) in the case of the \(\epsilon\)-insensitive loss. Here, types allow for member variables to hide that information away from the method signatures.

In order to avoid potential confusions with true Julia functions, we will refer to “loss functions” as “losses” instead. The available losses share a common interface for the most part. This section will provide an overview of the basic functionality that is available for all the different types of losses. We will discuss how to create a loss, how to compute its value and derivative, and how to query its properties.

Instantiating a Loss

Losses are immutable types. As such, one has to instantiate one in order to work with it. For most losses, the constructors do not expect any parameters.

julia> L2DistLoss()
LossFunctions.LPDistLoss{2}()

julia> HingeLoss()
LossFunctions.L1HingeLoss()

We just said that we need to instantiate a loss in order to work with it. One could be inclined to belief, that it would be more memory-efficient to “pre-allocate” a loss when using it in more than one place.

julia> loss = L2DistLoss()
LossFunctions.LPDistLoss{2}()

julia> value(loss, 2, 3)
1

However, that is a common oversimplification. Because all losses are immutable types, they can live on the stack and thus do not come with a heap-allocation overhead.

Even more interesting in the example above, is that for such losses as L2DistLoss, which do not have any constructor parameters or member variables, there is no additional code executed at all. Such singletons are only used for dispatch and don’t even produce any additional code, which you can observe for yourself in the code below. As such they are zero-cost abstractions.

julia> v1(loss,t,y) = value(loss,t,y)

julia> v2(t,y) = value(L2DistLoss(),t,y)

julia> @code_llvm v1(loss, 2, 3)
define i64 @julia_v1_70944(i64, i64) #0 {
top:
  %2 = sub i64 %1, %0
  %3 = mul i64 %2, %2
  ret i64 %3
}

julia> @code_llvm v2(2, 3)
define i64 @julia_v2_70949(i64, i64) #0 {
top:
  %2 = sub i64 %1, %0
  %3 = mul i64 %2, %2
  ret i64 %3
}

On the other hand, some types of losses are actually more comparable to whole families of losses instead of just a single one. For example, the immutable type L1EpsilonInsLoss has a free parameter \(\epsilon\). Each concrete \(\epsilon\) results in a different concrete loss of the same family of epsilon-insensitive losses.

julia> L1EpsilonInsLoss(0.5)
LossFunctions.L1EpsilonInsLoss{Float64}(0.5)

julia> L1EpsilonInsLoss(1)
LossFunctions.L1EpsilonInsLoss{Float64}(1.0)

For such losses that do have parameters, it can make a slight difference to pre-instantiate a loss. While they will live on the stack, the constructor usually performs some assertions and conversion for the given parameter. This can come at a slight overhead. At the very least it will not produce the same exact code when pre-instantiated. Still, the fact that they are immutable makes them very efficient abstractions with little to no performance overhead, and zero memory allocations on the heap.

Computing the Values

The first thing we may want to do is compute the loss for some observation (singular). In fact, all losses are implemented on single observations under the hood. The core function to compute the value of a loss is value(). We will see throughout the documentation that this function allows for a lot of different method signatures to accomplish a variety of tasks.

value(loss, target, output) → Number

Computes the result for the loss-function denoted by the parameter loss. Note that target and output can be of different numeric type, in which case promotion is performed in the manner appropriate for the given loss.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

\[L : Y \times \mathbb{R} \rightarrow [0,\infty)\]
Parameters:
  • loss (SupervisedLoss) – The loss-function \(L\) we want to compute the value with.
  • target (Number) – The ground truth \(y \in Y\) of the observation.
  • output (Number) – The predicted output \(\hat{y} \in \mathbb{R}\) for the observation.
Returns:

The (non-negative) numeric result of the loss-function for the given parameters.

#               loss        y    ŷ
julia> value(L1DistLoss(), 1.0, 2.0)
1.0

julia> value(L1DistLoss(), 1, 2)
1

julia> value(L1HingeLoss(), -1, 2)
3

julia> value(L1HingeLoss(), -1f0, 2f0)
3.0f0

It may be interesting to note, that this function also supports broadcasting and all the syntax benefits that come with it. Thus, it is quite simple to make use of preallocated memory for storing the element-wise results.

julia> value.(L1DistLoss(), [1,2,3], [2,5,-2])
3-element Array{Int64,1}:
 1
 3
 5

julia> buffer = zeros(3); # preallocate a buffer

julia> buffer .= value.(L1DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
 1.0
 3.0
 5.0

Furthermore, with the loop fusion changes that were introduced in Julia 0.6, one can also easily weight the influence of each observation without allocating a temporary array.

julia> buffer .= value.(L1DistLoss(), [1.,2,3], [2,5,-2]) .* [2,1,0.5]
3-element Array{Float64,1}:
 2.0
 3.0
 2.5

Even though broadcasting is supported, we do expose a vectorized method natively. This is done mainly for API consistency reasons. Internally it even uses broadcast itself, but it does provide the additional benefit of a more reliable type-inference.

value(loss, targets, outputs) → Array

Computes the value of the loss function for each index-pair in targets and outputs individually and returns the result as an array of the appropriate size.

In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • loss (SupervisedLoss) – The loss-function we want to compute the values for.
  • targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
Returns:

The element-wise results of the loss function for all values in targets and outputs.

julia> value(L1DistLoss(), [1,2,3], [2,5,-2])
3-element Array{Int64,1}:
 1
 3
 5

julia> value(L1DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
 1.0
 3.0
 5.0

We also provide a mutating version for the same reasons. It even utilizes broadcast! underneath.

value!(buffer, loss, targets, outputs)

Computes the value of the loss function for each index-pair in targets and outputs individually, and stores them in the preallocated buffer, which has to be of the appropriate size.

In the case that the two parameters, targets and outputs, are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • buffer (AbstractArray) – Array to store the computed values in. Old values will be overwritten and lost.
  • loss (SupervisedLoss) – The loss-function we want to compute the values for.
  • targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
Returns:

buffer (for convenience).

julia> buffer = zeros(3); # preallocate a buffer

julia> value!(buffer, L1DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
 1.0
 3.0
 5.0

Computing the 1st Derivatives

Maybe the more interesting aspect of loss functions are their derivatives. In fact, most of the popular learning algorithm in Supervised Learning, such as gradient descent, utilize the derivatives of the loss in one way or the other during the training process.

To compute the derivative of some loss we expose the function deriv(). It supports the same exact method signatures as value(). It may be interesting to note explicitly, that we always compute the derivative in respect to the predicted output, since we are interested in deducing in which direction the output should change.

deriv(loss, target, output) → Number

Computes the derivative for the loss-function denoted by the parameter loss in respect to the output. Note that target and output can be of different numeric type, in which case promotion is performed in the manner appropriate for the given loss.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • loss (SupervisedLoss) – The loss-function \(L\) we want to compute the derivative with.
  • target (Number) – The ground truth \(y \in Y\) of the observation.
  • output (Number) – The predicted output \(\hat{y} \in \mathbb{R}\) for the observation.
Returns:

The derivative of the loss-function for the given parameters.

#               loss        y    ŷ
julia> deriv(L2DistLoss(), 1.0, 2.0)
2.0

julia> deriv(L2DistLoss(), 1, 2)
2

julia> deriv(L2HingeLoss(), -1, 2)
6

julia> deriv(L2HingeLoss(), -1f0, 2f0)
6.0f0

Similar to value(), this function also supports broadcasting and all the syntax benefits that come with it. Thus, one can make use of preallocated memory for storing the element-wise derivatives.

julia> deriv.(L2DistLoss(), [1,2,3], [2,5,-2])
3-element Array{Int64,1}:
   2
   6
 -10

julia> buffer = zeros(3); # preallocate a buffer

julia> buffer .= deriv.(L2DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
   2.0
   6.0
 -10.0

Furthermore, with the loop fusion changes that were introduced in Julia 0.6, one can also easily weight the influence of each observation without allocating a temporary array.

julia> buffer .= deriv.(L2DistLoss(), [1.,2,3], [2,5,-2]) .* [2,1,0.5]
3-element Array{Float64,1}:
  4.0
  6.0
 -5.0

While broadcast is supported, we do expose a vectorized method natively. This is done mainly for API consistency reasons. Internally it even uses broadcast itself, but it does provide the additional benefit of a more reliable type-inference.

deriv(loss, targets, outputs) → Array

Computes the derivative of the loss function in respect to the output for each index-pair in targets and outputs individually and returns the result as an array of the appropriate size.

In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • loss (SupervisedLoss) – The loss-function we want to compute the derivative for.
  • targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
Returns:

The element-wise derivatives of the loss function for all elements in targets and outputs.

julia> deriv(L2DistLoss(), [1,2,3], [2,5,-2])
3-element Array{Int64,1}:
   2
   6
 -10

julia> deriv(L2DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
   2.0
   6.0
 -10.0

We also provide a mutating version for the same reasons. It even utilizes broadcast! underneath.

deriv!(buffer, loss, targets, outputs)

Computes the derivatives of the loss function in respect to the outputs for each index-pair in targets and outputs individually, and stores them in the preallocated buffer, which has to be of the appropriate size.

In the case that the two parameters targets and outputs are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • buffer (AbstractArray) – Array to store the computed derivatives in. Old values will be overwritten and lost.
  • loss (SupervisedLoss) – The loss-function we want to compute the derivatives for.
  • targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
Returns:

buffer (for convenience).

julia> buffer = zeros(3); # preallocate a buffer

julia> deriv!(buffer, L2DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
   2.0
   6.0
 -10.0

It is also possible to compute the value and derivative at the same time. For some losses that means less computation overhead.

value_deriv(loss, target, output) → Tuple

Returns the results of value() and deriv() as a tuple. In some cases this function can yield better performance, because the losses can make use of shared variables when computing the results. Note that target and output can be of different numeric type, in which case promotion is performed in the manner appropriate for the given loss.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • loss (SupervisedLoss) – The loss-function we are working with.
  • target (Number) – The ground truth \(y \in Y\) of the observation.
  • output (Number) – The predicted output \(\hat{y} \in \mathbb{R}\) for the observation.
Returns:

The value and the derivative of the loss-function for the given parameters. They are returned as a Tuple in which the first element is the value and the second element the derivative.

#                     loss         y    ŷ
julia> value_deriv(L2DistLoss(), -1.0, 3.0)
(16.0,8.0)

Computing the 2nd Derivatives

Additionally to the first derivative, we also provide the corresponding methods for the second derivative through the function deriv2(). Note again, that we always compute the derivative in respect to the predicted output.

deriv2(loss, target, output) → Number

Computes the second derivative for the loss-function denoted by the parameter loss in respect to the output. Note that target and output can be of different numeric type, in which case promotion is performed in the manner appropriate for the given loss.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • loss (SupervisedLoss) – The loss-function \(L\) we want to compute the second derivative with.
  • target (Number) – The ground truth \(y \in Y\) of the observation.
  • output (Number) – The predicted output \(\hat{y} \in \mathbb{R}\) for the observation.
Returns:

The second derivative of the loss-function for the given parameters.

#               loss             y    ŷ
julia> deriv2(LogitDistLoss(), -0.5, 0.3)
0.42781939304058886

julia> deriv2(LogitMarginLoss(), -1f0, 2f0)
0.104993574f0

Just like deriv() and value(), this function also supports broadcasting and all the syntax benefits that come with it. Thus, one can make use of preallocated memory for storing the element-wise derivatives.

julia> deriv2.(LogitDistLoss(), [-0.5, 1.2, 3], [0.3, 2.3, -2])
3-element Array{Float64,1}:
 0.427819
 0.37474
 0.0132961

julia> buffer = zeros(3); # preallocate a buffer

julia> buffer .= deriv2.(LogitDistLoss(), [-0.5, 1.2, 3], [0.3, 2.3, -2])
3-element Array{Float64,1}:
 0.427819
 0.37474
 0.0132961

Furthermore deriv2() supports all the same method signatures as deriv() does. So to avoid repeating the same text over and over again, please look at the documentation of deriv() for more information.

Function Closures

In some circumstances it may be convenient to have the loss function or its derivative as a proper Julia function. Instead of exporting special function names for every implemented loss (like l2distloss(...)), we provide the ability to generate a true function on the fly for any given loss.

value_fun(loss) → Function

Returns a new function that computes the value() for the given loss. This new function will support all the signatures that value() does.

Parameters:loss (Loss) – The loss we want the function for.
julia> f = value_fun(L2DistLoss())
(::_value) (generic function with 1 method)

julia> f(-1.0, 3.0) # computes the value of L2DistLoss
16.0

julia> f.([1.,2], [4,7])
2-element Array{Float64,1}:
  9.0
 25.0
deriv_fun(loss) → Function

Returns a new function that computes the deriv() for the given loss. This new function will support all the signatures that deriv() does.

Parameters:loss (Loss) – The loss we want the derivative-function for.
julia> g = deriv_fun(L2DistLoss())
(::_deriv) (generic function with 1 method)

julia> g(-1.0, 3.0) # computes the deriv of L2DistLoss
8.0

julia> g.([1.,2], [4,7])
2-element Array{Float64,1}:
  6.0
 10.0
deriv2_fun(loss) → Function

Returns a new function that computes the deriv2() (i.e. second derivative) for the given loss. This new function will support all the signatures that deriv2() does.

Parameters:loss (Loss) – The loss we want the second-derivative function for.
julia> g2 = deriv2_fun(L2DistLoss())
(::_deriv2) (generic function with 1 method)

julia> g2(-1.0, 3.0) # computes the second derivative of L2DistLoss
2.0

julia> g2.([1.,2], [4,7])
2-element Array{Float64,1}:
 2.0
 2.0
value_deriv_fun(loss) → Function

Returns a new function that computes the value_deriv() for the given loss. This new function will support all the signatures that value_deriv() does.

Parameters:loss (Loss) – The loss we want the function for.
julia> fg = value_deriv_fun(L2DistLoss())
(::_value_deriv) (generic function with 1 method)

julia> fg(-1.0, 3.0) # computes the second derivative of L2DistLoss
(16.0,8.0)

Note, however, that these closures cause quite an overhead when executed in the global scope. If you want to use them efficiently, either don’t create them in global scope, or make sure that you pass the closure to some other function before it is used. This way the compiler will most likely inline it and it will be a zero cost abstraction.

julia> f = value_fun(L2DistLoss())
(::_value) (generic function with 1 method)

julia> @code_llvm f(-1.0, 3.0)
define %jl_value_t* @julia__value_70960(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
  %3 = alloca %jl_value_t**, align 8
  store volatile %jl_value_t** %1, %jl_value_t*** %3, align 8
  %ptls_i8 = call i8* asm "movq %fs:0, $0;\0Aaddq $$-2672, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #2
    [... many more lines of code ...]
  %15 = call %jl_value_t* @jl_f__apply(%jl_value_t* null, %jl_value_t** %5, i32 3)
  %16 = load i64, i64* %11, align 8
  store i64 %16, i64* %9, align 8
  ret %jl_value_t* %15
}

julia> foo(t,y) = (f = value_fun(L2DistLoss()); f(t,y))
foo (generic function with 1 method)

julia> @code_llvm foo(-1.0, 3.0)
define double @julia_foo_71242(double, double) #0 {
top:
  %2 = fsub double %1, %0
  %3 = fmul double %2, %2
  ret double %3
}

Properties of a Loss

In some situations it can be quite useful to assert certain properties about a loss-function. One such scenario could be when implementing an algorithm that requires the loss to be strictly convex or Lipschitz continuous. Note that we will only skim over the defintions in most cases. A good treatment of all of the concepts involved can be found in either [BOYD2004] or [STEINWART2008].

[BOYD2004]Stephen Boyd and Lieven Vandenberghe. “Convex Optimization”. Cambridge University Press, 2004.
[STEINWART2008]Steinwart, Ingo, and Andreas Christmann. “Support vector machines”. Springer Science & Business Media, 2008.

This package uses functions to represent individual properties of a loss. It follows a list of implemented property-functions defined in LearnBase.jl.

isconvex(loss) → Bool

Returns true if given loss is a convex function. A function \(f : \mathbb{R}^n \rightarrow \mathbb{R}\) is convex if its domain is a convex set and if for all \(x, y\) in that domain, with \(\theta\) such that for \(0 \leq \theta \leq 1\) , we have

\[f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y)\]
Parameters:loss (Loss) – The loss we want to check for convexity.
julia> isconvex(LPDistLoss(0.5))
false

julia> isconvex(ZeroOneLoss())
false

julia> isconvex(L1DistLoss())
true

julia> isconvex(L2DistLoss())
true
isstrictlyconvex(loss) → Bool

Returns true if given loss is a strictly convex function. A function \(f : \mathbb{R}^n \rightarrow \mathbb{R}\) is strictly convex if its domain is a convex set and if for all \(x, y\) in that domain where \(x \neq y\), with \(\theta\) such that for \(0 < \theta < 1\) , we have

\[\begin{split}f(\theta x + (1 - \theta) y) < \theta f(x) + (1 - \theta) f(y)\end{split}\]
Parameters:loss (Loss) – The loss we want to check for strict convexity.
julia> isstrictlyconvex(L1DistLoss())
false

julia> isstrictlyconvex(LogitDistLoss())
true

julia> isstrictlyconvex(L2DistLoss())
true
isstronglyconvex(loss) → Bool

Returns true if given loss is a strongly convex function. A function \(f : \mathbb{R}^n \rightarrow \mathbb{R}\) is \(m\)-strongly convex if its domain is a convex set and if \(\forall x,y \in\) dom \(f\) where \(x \neq y\), and \(\theta\) such that for \(0\) \(\le\) \(\theta\) \(\le\) \(1\) , we have

\[\begin{split}f(\theta x + (1 - \theta)y) < \theta f(x) + (1 - \theta) f(y) - 0.5 m \cdot \theta (1 - \theta) {\| x - y \|}_2^2\end{split}\]

In a more familiar setting, if the loss function is differentiable we have

\[\left( \nabla f(x) - \nabla f(y) \right)^\top (x - y) \ge m {\| x - y\|}_2^2\]
Parameters:loss (Loss) – The loss we want to check for strong convexity.
julia> isstronglyconvex(L1DistLoss())
false

julia> isstronglyconvex(LogitDistLoss())
false

julia> isstronglyconvex(L1DistLoss())
true
isdifferentiable(loss[, at]) → Bool

Returns true if given loss is differentiable (optionally only at the given point if at is specified). A function \(f : \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}\) is differentiable at a point \(x \in\) int dom \(f\) if there exists a matrix \(Df(x)\) in \(\mathbb{R}^{m \times n}\) such that it satisfies:

\[\lim_{z \neq x, z \to x} \frac{{\|f(z) - f(x) - Df(x)(z-x)\|}_2}{{\|z - x\|}_2} = 0\]

A function is differentiable if its domain is open and it is differentiable at every point \(x\).

Parameters:
  • loss (Loss) – The loss we want to check for differentiability.
  • at (Number) – Optional. The point x for which it should be checked if the function is differentiable at.
julia> isdifferentiable(L1DistLoss())
false

julia> isdifferentiable(L1DistLoss(), 1)
true

julia> isdifferentiable(L2DistLoss())
true
istwicedifferentiable(loss[, at]) → Bool

Returns true if given loss is a twice differentiable function (optionally only at the given point if at is specified). A function \(f : \mathbb{R}^{n} \rightarrow \mathbb{R}\) is said to be twice differentiable at a point \(x \in\) int dom \(f\) if the function derivative for \(\nabla f\) exists at \(x\).

\[\nabla^2 f(x) = D \nabla f(x)\]

A function is twice differentiable if its domain is open and it is twice differentiable at every point \(x\).

Parameters:
  • loss (Loss) – The loss we want to check for differentiability.
  • at (Number) – Optional. The point x for which it should be checked if the function is twice differentiable at.
julia> istwicedifferentiable(L1DistLoss())
false

julia> istwicedifferentiable(L1DistLoss())
true
isnemitski(loss) → Bool

Returns true if given loss is a Nemitski loss function.

We call a supervised loss function \(L : Y \times \mathbb{R} \rightarrow [0,\infty)\) a Nemitski loss if there exist a measurable function \(b : Y \rightarrow [0, \infty)\) and an increasing function \(h : [0, \infty) \rightarrow [0, \infty)\) such that

\[L(y,\hat{y}) \le b(y) + h(|\hat{y}|), \qquad (y, \hat{y}) \in Y \times \mathbb{R}.\]
islipschitzcont(loss) → Bool

Returns true if given loss function is Lipschitz continuous.

A supervised loss function \(L : Y \times \mathbb{R} \rightarrow [0, \infty)\) is Lipschitz continous if there exists a finite constant \(M < \infty\) such that

\[|L(y, t) - L(y, t′)| \le M |t - t′|, \qquad \forall (y, t) \in Y \times \mathbb{R}\]
Parameters:loss (Loss) – The loss we want to check for being Lipschitz continuous.
julia> islipschitzcont(SigmoidLoss())
true

julia> islipschitzcont(ExpLoss())
false
islocallylipschitzcont(loss) → Bool

Returns true if given loss function is locally-Lipschitz continous.

A supervised loss \(L : Y \times \mathbb{R} \rightarrow [0, \infty)\) is called locally Lipschitz continuous if \(\forall a \ge 0\) there exists a constant \(c_a \ge 0\) such that

\[\sup_{y \in Y} \left| L(y,t) − L(y,t′) \right| \le c_a |t − t′|, \qquad t,t′ \in [−a,a]\]
Parameters:loss (Loss) – The loss we want to check for being locally Lipschitz-continous.
julia> islocallylipschitzcont(ExpLoss())
true

julia> islocallylipschitzcont(SigmoidLoss())
true
isclipable(loss) → Bool

Returns true if given loss function is clipable. A supervised loss \(L : Y \times \mathbb{R} \rightarrow [0, \infty)\) can be clipped at \(M > 0\) if, for all \((y,t) \in Y \times \mathbb{R}\),

\[L(y, \hat{t}) \le L(y, t)\]

where \(\hat{t}\) denotes the clipped value of \(t\) at \(\pm M\). That is

\[\begin{split}\hat{t} = \begin{cases} -M & \quad \text{if } t < -M \\ t & \quad \text{if } t \in [-M, M] \\ M & \quad \text{if } t > M \end{cases}\end{split}\]
Parameters:loss (Loss) – The loss we want to check for being clipable.
julia> isclipable(ExpLoss())
false

julia> isclipable(L2DistLoss())
true
ismarginbased(loss) → Bool

Returns true if given loss is a margin-based Loss.

A supervised loss function \(L : Y \times \mathbb{R} \rightarrow [0, \infty)\) is said to be margin-based if there exists a representing function \(\psi : \mathbb{R} \rightarrow [0, \infty)\) satisfying

\[L(y, \hat{y}) = \psi (y \cdot \hat{y}), \qquad (y, \hat{y}) \in Y \times \mathbb{R}\]
Parameters:loss (Loss) – The loss we want to check for being margin-based.
julia> ismarginbased(HuberLoss(2))
false

julia> ismarginbased(L2MarginLoss())
true
isclasscalibrated(loss) → Bool
isdistancebased(loss) → Bool

Returns true if given loss is a distance-based Loss.

A supervised loss function \(L : Y \times \mathbb{R} \rightarrow [0, \infty)\) is said to be distance-based if there exists a representing function \(\psi : \mathbb{R} \rightarrow [0, \infty)\) satisfying \(\psi (0) = 0\) and

\[L(y, \hat{y}) = \psi (\hat{y} - y), \qquad (y, \hat{y}) \in Y \times \mathbb{R}\]
Parameters:loss (Loss) – The loss we want to check for being distance-based.
julia> isdistancebased(HuberLoss(2))
true

julia> isdistancebased(L2MarginLoss())
false
issymmetric(loss) → Bool

Returns true if given loss is a Symmetric Loss.

A function \(f : \mathbb{R} \rightarrow [0,\infty)\) is said to be symmetric about origin if we have

\[f(x) = f(-x), \qquad \forall x \in \mathbb{R}\]

A distance-based loss is said to be symmetric if its representing function is symmetric.

Parameters:loss (Loss) – The loss we want to check for being symmetric.
julia> issymmetric(QuantileLoss(0.2))
false

julia> issymetric(LPDistLoss(2))
true

Next we will consider how to average or sum the results of the loss functions more efficiently. The methods described here are implemented in such a way as to avoid allocating a temporary array.

Efficient Sum and Mean

In many situations we are not really that interested in the individual loss values (or derivatives) of each observation, but the sum or mean of them; be it weighted or unweighted. For example, by computing the unweighted mean of the loss for our training set, we would effectively compute what is known as the empirical risk. This is usually the quantity (or an important part of it) that the are interesting in minimizing.

When we say “weighted” or “unweighted”, we are referring to whether we are explicitly specifying the influence of individual observations on the result. “Weighing” an observation is achieved by multiplying its value with some number (i.e. the “weight” of that observation). As a consequence that weighted observation will have a stronger or weaker influence on the result. In order to weigh an observation we have to know which array dimension (if there are more than one) denotes the observations. On the other hand, for computing an unweighted result we don’t actually need to know anything about the meaning of the array dimensions, as long as the targets and the outputs are of compatible shape and size.

The naive way to compute such an unweighted reduction, would be to call mean or sum on the result of the element-wise operation. The following code snipped show an example of that. We say “naive”, because it will not give us an acceptable performance.

julia> value(L1DistLoss(), [1.,2,3], [2,5,-2])
3-element Array{Float64,1}:
 1.0
 3.0
 5.0

# WARNING: Bad code
julia> sum(value(L1DistLoss(), [1.,2,3], [2,5,-2]))
9.0

This works as expected, but there is a price for it. Before the sum can be computed, value will allocate a temporary array and fill it with the element-wise results. After that, sum will iterate over this temporary array and accumulate the values accordingly. Bottom line: we allocate temporary memory that we don’t need in the end and could avoid.

For that reason we provide special methods that compute the common accumulations efficiently without allocating temporary arrays. These methods can be invoked using an additional parameter which specifies how the values should be accumulated / averaged. The type of this parameter has to be a subtype of AverageMode.

Average Modes

Before we discuss these memory-efficient methods, let us briefly introduce the available average mode types. We provide a number of different averages modes, all of which are contained within the namespace AvgMode. An instance of such type can then be used as additional parameter to value(), deriv(), and deriv2(), as we will see further down.

It follows a list of available average modes. Each of which with a short description of what their effect would be when used as an additional parameter to the functions mentioned above.

class AvgMode.None

Used by default. This will cause the element-wise results to be returned.

class AvgMode.Sum

Causes the method to return the unweighted sum of the elements instead of the individual elements. Can be used in combination with ObsDim, in which case a vector will be returned containing the sum for each observation (useful mainly for multivariable regression).

class AvgMode.Mean

Causes the method to return the unweighted mean of the elements instead of the individual elements. Can be used in combination with ObsDim, in which case a vector will be returned containing the mean for each observation (useful mainly for multivariable regression).

class AvgMode.WeightedSum

Causes the method to return the weighted sum of all observations. The variable weights has to be a vector of the same length as the number of observations. If normalize = true, the values of the weight vector will be normalized in such as way that they sum to one.

weights

Vector of weight values that can be used to give certain observations a stronger influence on the sum.

julia> AvgMode.WeightedSum([1,1,2]); # 3 observations
normalize

Boolean that specifies if the weight vector should be transformed in such a way that it sums to one (i.e. normalized). This will not mutate the weight vector but instead happen on the fly during the accumulation.

Defaults to false. Setting it to true only really makes sense in multivalue-regression, otherwise the result will be the same as for WeightedMean.

julia> AvgMode.WeightedSum([1,1,2], normalize = true);
class AvgMode.WeightedMean

Causes the method to return the weighted mean of all observations. The variable weights has to be a vector of the same length as the number of observations. If normalize = true, the values of the weight vector will be normalized in such as way that they sum to one.

weights

Vector of weight values that can be used to give certain observations a stronger influence on the mean.

julia> AvgMode.WeightedMean([1,1,2]); # 3 observations
normalize

Boolean that specifies if the weight vector should be transformed in such a way that it sums to one (i.e. normalized). This will not mutate the weight vector but instead happen on the fly during the accumulation.

Defaults to true. Setting it to false only really makes sense in multivalue-regression, otherwise the result will be the same as for WeightedSum.

julia> AvgMode.WeightedMean([1,1,2], normalize = false);

Unweighted Sum and Mean

As hinted before, we provide special memory efficient methods for computing the sum or the mean of the element-wise (or broadcasted) results of value(), deriv(), and deriv2(). These methods avoid the allocation of a temporary array and instead compute the result directly.

value(loss, targets, outputs, avgmode) → Number

Computes the unweighted sum or mean (depending on avgmode) of the individual values of the loss function for each pair in targets and outputs. This method will not allocate a temporary array.

In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • loss (SupervisedLoss) – The loss-function we are interested in.
  • targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
  • avgmode (AverageMode) – Must either be AvgMode.Sum() or AvgMode.Mean()
Returns:

The unweighted sum or mean of the individual values of the loss function for all values in targets and outputs.

Return type:

Number

julia> value(L1DistLoss(), [1,2,3], [2,5,-2], AvgMode.Sum())
9

julia> value(L1DistLoss(), [1.,2,3], [2,5,-2], AvgMode.Sum())
9.0

julia> value(L1DistLoss(), [1,2,3], [2,5,-2], AvgMode.Mean())
3.0

julia> value(L1DistLoss(), Float32[1,2,3], Float32[2,5,-2], AvgMode.Mean())
3.0f0

The exact same method signature is also implemented for deriv() and deriv2() respectively.

deriv(loss, targets, outputs, avgmode) → Number

Computes the unweighted sum or mean (depending on avgmode) of the individual derivatives of the loss function for each pair in targets and outputs. This method will not allocate a temporary array.

In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • loss (SupervisedLoss) – The loss-function we are interested in.
  • targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
  • avgmode (AverageMode) – Must either be AvgMode.Sum() or AvgMode.Mean()
Returns:

The unweighted sum or mean of the individual derivatives of the loss function for all values in targets and outputs.

Return type:

Number

julia> deriv(L2DistLoss(), [1,2,3], [2,5,-2], AvgMode.Sum())
-2

julia> deriv(L2DistLoss(), [1,2,3], [2,5,-2], AvgMode.Mean())
-0.6666666666666665
deriv2(loss, targets, outputs, avgmode) → Number

Computes the unweighted sum or mean (depending on avgmode) of the individual 2nd derivatives of the loss function for each pair in targets and outputs. This method will not allocate a temporary array.

In the case that the two parameters are arrays with a different number of dimensions, broadcast will be performed. Note that the given parameters are expected to have the same size in the dimensions they share.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • loss (SupervisedLoss) – The loss-function we are interested in.
  • targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
  • avgmode (AverageMode) – Must either be AvgMode.Sum() or AvgMode.Mean()
Returns:

The unweighted sum or mean of the individual 2nd derivatives of the loss function for all values in targets and outputs.

Return type:

Number

julia> deriv2(LogitDistLoss(), [1.,2,3], [2,5,-2], AvgMode.Sum())
0.49687329928636825

julia> deriv2(LogitDistLoss(), [1.,2,3], [2,5,-2], AvgMode.Mean())
0.1656244330954561

Sum and Mean per Observation

When the targets and predicted outputs are multi-dimensional arrays instead of vectors, we may be interested in accumulating the values over all but one dimension. This is typically the case when we work in a multi-variable regression setting, where each observation has multiple outputs and thus multiple targets. In those scenarios we may be more interested in the average loss for each observation, rather than the total average over all the data.

To be able to accumulate the values for each observation separately, we have to know and explicitly specify the dimension that denotes the observations. For that purpose we provide the types contained in the namespace ObsDim.

value(loss, targets, outputs, avgmode, obsdim) → Vector

Computes the values of the loss function for each pair in targets and outputs individually, and returns either the unweighted sum or mean for each observation (depending on avgmode). This method will not allocate a temporary array, but it will allocate the resulting vector.

Both arrays have to be of the same shape and size. Furthermore they have to have at least two array dimensions (i.e. they must not be vectors).

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • loss (SupervisedLoss) – The loss-function we are interested in.
  • targets (AbstractArray) – The multi-dimensional array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The multi-dimensional array of predicted outputs \(\mathbf{\hat{y}}\).
  • avgmode (AverageMode) – Must either be AvgMode.Sum() or AvgMode.Mean()
  • obsdim (ObsDimension) – Specifies which of the array dimensions denotes the observations. see ?ObsDim for more information.
Returns:

A vector that contains the unweighted sums / means of the loss for each observation in targets and outputs.

Return type:

Vector

Consider the following two matrices, targets and outputs. There are two ways to interpret the shape of these arrays if one dimension is to denote the observations.

julia> targets = rand(2,4)
2×4 Array{Float64,2}:
 0.0743675  0.285303  0.247157  0.223666
 0.513145   0.59224   0.32325   0.989964

julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
 0.6335    0.319131  0.637087  0.613777
 0.513495  0.264587  0.533555  0.714688

The first interpretation would be to say that the first dimension denotes the observations. Thus this data would consist of two observations with four variables each.

julia> value(L1DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
 1.373
 0.813583

julia> value(L1DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.First())
2-element Array{Float64,1}:
 0.34325
 0.203396

The second possible interpretation would be to say that the second/last dimension denotes the observations. In that case our data consists of four observations with two variables each.

julia> value(L1DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.Last())
4-element Array{Float64,1}:
 0.559482
 0.36148
 0.600235
 0.665386

julia> value(L1DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.Last())
4-element Array{Float64,1}:
 0.279741
 0.18074
 0.300118
 0.332693

Because this method returns a vector of values, we also provide a mutating version that can make use a preallocated vector to write the results into.

value!(buffer, loss, targets, outputs, avgmode, obsdim) → Vector

Computes the values of the loss function for each pair in targets and outputs individually, and returns either the unweighted sum or mean for each observation, depending on avgmode. The results are stored into the given vector buffer. This method will not allocate a temporary array.

Both arrays have to be of the same shape and size. Furthermore they have to have at least two array dimensions (i.e. so they must not be vectors).

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • buffer (AbstractVector) – Array to store the computed values in. Old values will be overwritten and lost.
  • loss (SupervisedLoss) – The loss-function we are interested in.
  • targets (AbstractArray) – The multi-dimensional array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The multi-dimensional array of predicted outputs \(\mathbf{\hat{y}}\).
  • avgmode (AverageMode) – Must either be AvgMode.Sum() or AvgMode.Mean()
  • obsdim (ObsDimension) – Specifies which of the array dimensions denotes the observations. see ?ObsDim for more information.
Returns:

buffer (for convenience).

julia> buffer = zeros(2);

julia> value!(buffer, L1DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
 1.373
 0.813583

julia> value!(buffer, L1DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.First())
2-element Array{Float64,1}:
 0.34325
 0.203396

julia> buffer = zeros(4);

julia> value!(buffer, L1DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.Last())
4-element Array{Float64,1}:
 0.559482
 0.36148
 0.600235
 0.665386

julia> value!(buffer, L1DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.Last())
4-element Array{Float64,1}:
 0.279741
 0.18074
 0.300118
 0.332693

We also provide both of these methods for deriv() and deriv2() respectively.

deriv(loss, targets, outputs, avgmode, obsdim) → Vector

Same as below, but using the 1st derivative.

deriv2(loss, targets, outputs, avgmode, obsdim) → Vector

Computes the (2nd) derivatives of the loss function for each pair in targets and outputs individually and returns either the unweighted sum or mean for each observation (depending on avgmode). This method will not allocate a temporary array, but it will allocate the resulting vector.

Both arrays have to be of the same shape and size. Furthermore they have to have at least two array dimensions (i.e. so they must not be vectors).

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • loss (SupervisedLoss) – The loss-function we are interested in.
  • targets (AbstractArray) – The multi-dimensional array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The multi-dimensional array of predicted outputs \(\mathbf{\hat{y}}\).
  • avgmode (AverageMode) – Must either be AvgMode.Sum() or AvgMode.Mean()
  • obsdim (ObsDimension) – Specifies which of the array dimensions denotes the observations. see ?ObsDim for more information.
Returns:

A vector that contains the unweighted sums / means of the (2nd) loss-derivatives for each observation in targets and outputs.

Return type:

Vector

julia> targets = rand(2,4)
2×4 Array{Float64,2}:
 0.0743675  0.285303  0.247157  0.223666
 0.513145   0.59224   0.32325   0.989964

julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
 0.6335    0.319131  0.637087  0.613777
 0.513495  0.264587  0.533555  0.714688

julia> deriv(L2DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
  2.746
 -0.784548

julia> deriv(L2DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.First())
2-element Array{Float64,1}:
  0.686501
 -0.196137

julia> deriv(L2DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.Last())
4-element Array{Float64,1}:
  1.11896
 -0.58765
  1.20047
  0.22967

julia> deriv(L2DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.Last())
4-element Array{Float64,1}:
  0.559482
 -0.293825
  0.600235
  0.114835

Because this method returns a vector of values, we also provide a mutating version that can make use a preallocated vector to write the results into.

deriv!(buffer, loss, targets, outputs, avgmode, obsdim) → Vector

Same as below, but using the 1st derivative.

deriv2!(buffer, loss, targets, outputs, avgmode, obsdim) → Vector

Computes the (2nd) derivatives of the loss function for each pair in targets and outputs individually, and returns either the unweighted sums or means for each observation, depending on avgmode. The results are stored into the given vector buffer. This method will not allocate a temporary array.

Both arrays have to be of the same shape and size. Furthermore they have to have at least two array dimensions (i.e. so they must not be vectors).

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • buffer (AbstractVector) – Array to store the computed values in. Old values will be overwritten and lost.
  • loss (SupervisedLoss) – The loss-function we are interested in.
  • targets (AbstractArray) – The multi-dimensional array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The multi-dimensional array of predicted outputs \(\mathbf{\hat{y}}\).
  • avgmode (AverageMode) – Must either be AvgMode.Sum() or AvgMode.Mean()
  • obsdim (ObsDimension) – Specifies which of the array dimensions denotes the observations. see ?ObsDim for more information.
Returns:

buffer (for convenience).

julia> buffer = zeros(2);

julia> deriv!(buffer, L2DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.First())
2-element Array{Float64,1}:
  2.746
 -0.784548

julia> deriv!(buffer, L2DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.First())
2-element Array{Float64,1}:
  0.686501
 -0.196137

julia> buffer = zeros(4);

julia> deriv!(buffer, L2DistLoss(), targets, outputs, AvgMode.Sum(), ObsDim.Last())
4-element Array{Float64,1}:
  1.11896
 -0.58765
  1.20047
  0.22967

julia> deriv!(buffer, L2DistLoss(), targets, outputs, AvgMode.Mean(), ObsDim.Last())
4-element Array{Float64,1}:
  0.559482
 -0.293825
  0.600235
  0.114835

Weighted Sum and Mean

Up to this point, all the averaging was performed in an unweighted manner. That means that each observation was treated as equal and had thus the same potential influence on the result. In this sub-section we will consider the situations in which we do want to explicitly specify the influence of each observation (i.e. we want to weigh them). When we say we “weigh” an observation, what it effectively boils down to is multiplying the result for that observation (i.e. the computed loss or derivative) with some number. This is done for every observation individually.

To get a better understand of what we are talking about, let us consider performing a weighting scheme manually. The following code will compute the loss for three observations, and then multiply the result of the second observation with the number 2, while the other two remains as they are. If we then sum up the results, we will see that the loss of the second observation was effectively counted twice.

julia> result = value.(L1DistLoss(), [1.,2,3], [2,5,-2]) .* [1,2,1]
3-element Array{Float64,1}:
 1.0
 6.0
 5.0

julia> sum(result)
12.0

The point of weighing observations is to inform the learning algorithm we are working with, that it is more important to us to predict some observations correctly than it is for others. So really, the concrete weight-factor matters less than the ratio between the different weights. In the example above the second observation was thus considered twice as important as any of the other two observations.

In the case of multi-dimensional arrays the process isn’t that simple anymore. In such a scenario, computing the weighted sum (or weighted mean) can be thought of as having an additional step. First we either compute the sum or (unweighted) average for each observation (which results in a vector), and then we compute the weighted sum of all observations.

The following code snipped demonstrates how to compute the AvgMode.WeightedSum([2,1]) manually. This is not meant as an example of how to do it, but simply to show what is happening qualitatively. In this example we assume that we are working in a multi-variable regression setting, in which our data set has four observations with two target-variables each.

julia> targets = rand(2,4)
2×4 Array{Float64,2}:
 0.0743675  0.285303  0.247157  0.223666
 0.513145   0.59224   0.32325   0.989964

julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
 0.6335    0.319131  0.637087  0.613777
 0.513495  0.264587  0.533555  0.714688

# WARNING: BAD CODE - ONLY FOR ILLUSTRATION

julia> tmp = sum(value.(L1DistLoss(), targets, outputs),2) # assuming ObsDim.First()
2×1 Array{Float64,2}:
 1.373
 0.813584

julia> sum(tmp .* [2, 1]) # weigh 1st observation twice as high
3.559587

To manually compute the result for AvgMode.WeightedMean([2,1]) we follow a similar approach, but use the normalized weight vector in the last step.

# WARNING: BAD CODE - ONLY FOR ILLUSTRATION

julia> tmp = mean(value.(L1DistLoss(), targets, outputs),2) # ObsDim.First()
2×1 Array{Float64,2}:
 0.34325
 0.203396

julia> sum(tmp .* [0.6666, 0.3333]) # weigh 1st observation twice as high
0.29660258677499995

Note that you can specify explicitly if you want to normalize the weight vector. That option is supported for computing the weighted sum, as well as for computing the weighted mean. See the documentation for AvgMode.WeightedSum and AvgMode.WeightedMean for more information.

The code-snippets above are of course very inefficient, because they allocate (multiple) temporary arrays. We only included them to demonstrate what is happening in terms of desired result / effect. For doing those computations efficiently we provide special methods for value(), deriv(), deriv2() and their mutating counterparts.

value(loss, targets, outputs, wavgmode[, obsdim]) → Number

Computes the values of the loss function for each pair in targets and outputs individually and returns either the weighted sum or mean for each observation (depending on wavgmode). This method will not allocate a temporary array. Both arrays have to be of the same shape and size.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • loss (SupervisedLoss) – The loss-function we are interested in.
  • targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
  • wavgmode (AverageMode) – Must either be of type AvgMode.WeightedSum or AvgMode.WeightedMean. Either way, the specified weight vector must have the same number of observations as targets and outputs.
  • obsdim (ObsDimension) – Optional. Default to ObsDim.Last(). Specifies which of the array dimensions denotes the observations. see ?ObsDim for more information.
Returns:

A vector that contains the unweighted sums / means of the loss for each observation in targets and outputs.

Return type:

Number

julia> value(L1DistLoss(), [1.,2,3], [2,5,-2], AvgMode.WeightedSum([1,2,1]))
12.0

julia> value(L1DistLoss(), [1.,2,3], [2,5,-2], AvgMode.WeightedMean([1,2,1]))
3.0
julia> targets = rand(2,4)
2×4 Array{Float64,2}:
 0.0743675  0.285303  0.247157  0.223666
 0.513145   0.59224   0.32325   0.989964

julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
 0.6335    0.319131  0.637087  0.613777
 0.513495  0.264587  0.533555  0.714688

julia> value(L1DistLoss(), targets, outputs, AvgMode.WeightedSum([2,1]), ObsDim.First())
3.5595869999999996

julia> value(L1DistLoss(), targets, outputs, AvgMode.WeightedMean([2,1]), ObsDim.First())
0.29663224999999993

We also provide both of these methods for deriv() and deriv2() respectively.

deriv(loss, targets, outputs, wavgmode[, obsdim]) → Number

Same as below, but using the 1st derivative.

deriv2(loss, targets, outputs, wavgmode[, obsdim]) → Number

Computes the (2nd) derivatives of the loss function for each pair in targets and outputs individually and returns either the weighted sum or mean for each observation (depending on wavgmode). This method will not allocate a temporary array. Both arrays have to be of the same shape and size.

Note: This function should always be type-stable. If it isn’t, you likely found a bug.

Parameters:
  • loss (SupervisedLoss) – The loss-function we are interested in.
  • targets (AbstractArray) – The array of ground truths \(\mathbf{y}\).
  • outputs (AbstractArray) – The array of predicted outputs \(\mathbf{\hat{y}}\).
  • wavgmode (AverageMode) – Must either be of type AvgMode.WeightedSum or AvgMode.WeightedMean. Either way, the specified weight vector must have the same number of observations as targets and outputs.
  • obsdim (ObsDimension) – Optional. Default to ObsDim.Last(). Specifies which of the array dimensions denotes the observations. see ?ObsDim for more information.
Returns:

A vector that contains the unweighted sums / means of the loss-derivatives for each observation in targets and outputs.

Return type:

Number

julia> deriv(L2DistLoss(), [1.,2,3], [2,5,-2], AvgMode.WeightedSum([1,2,1]))
4.0

julia> deriv(L2DistLoss(), [1.,2,3], [2,5,-2], AvgMode.WeightedMean([1,2,1]))
1.0
julia> targets = rand(2,4)
2×4 Array{Float64,2}:
 0.0743675  0.285303  0.247157  0.223666
 0.513145   0.59224   0.32325   0.989964

julia> outputs = rand(2,4)
2×4 Array{Float64,2}:
 0.6335    0.319131  0.637087  0.613777
 0.513495  0.264587  0.533555  0.714688

julia> deriv(L2DistLoss(), targets, outputs, AvgMode.WeightedSum([2,1]), ObsDim.First())
4.707458000000001

julia> value(L2DistLoss(), targets, outputs, AvgMode.WeightedMean([2,1]), ObsDim.First())
0.12194772056937497

Available Loss Functions

Aside from the interface, this package also provides a number of popular (and not so popular) loss functions out-of-the-box. Great effort has been put into ensuring a correct, efficient, and type-stable implementation for those. Most of them either belong to the family of distance-based or margin-based losses. These two categories are also indicative for if a loss is intended for regression or classification problems

Loss Functions for Regression

Loss functions that belong to the category “distance-based” are primarily used in regression problems. They utilize the numeric difference between the predicted output and the true target as a proxy variable to quantify the quality of individual predictions.

Distance-based Losses

This section lists all the subtypes of DistanceLoss that are implemented in this package.

LPDistLoss
class LPDistLoss

The \(p\)-th power absolute distance loss. It is Lipschitz continuous iff \(p = 1\), convex if and only if \(p \ge 1\), and strictly convex iff \(p > 1\).

Lossfunction Derivative
\[L(r) = | r | ^p\]
\[L'(r) = p \cdot r \cdot | r | ^{p-2}\]
L1DistLoss
class L1DistLoss

The absolute distance loss. Special case of the LPDistLoss with \(p = 1\). It is Lipschitz continuous and convex, but not strictly convex.

Lossfunction Derivative
\[L(r) = | r |\]
\[L'(r) = \textrm{sign}(r)\]
L2DistLoss
class L2DistLoss

The least squares loss. Special case of the LPDistLoss with \(p = 2\). It is strictly convex.

Lossfunction Derivative
\[L(r) = | r | ^2\]
\[L'(r) = 2 r\]
LogitDistLoss
class LogitDistLoss

The distance-based logistic loss for regression. It is strictly convex and Lipschitz continuous.

Lossfunction Derivative
\[L(r) = - \ln \frac{4 e^r}{(1 + e^r)^2}\]
\[L'(r) = \tanh \left( \frac{r}{2} \right)\]
HuberLoss
class HuberLoss
α

Loss function commonly used for robustness to outliers. For large values of \(\alpha\) it becomes close to the L1DistLoss, while for small values of \(\alpha\) it resembles the L2DistLoss. It is Lipschitz continuous and convex, but not strictly convex.

Lossfunction Derivative
\[\begin{split}L(r) = \begin{cases} \frac{r^2}{2} & \quad \text{if } | r | \le \alpha \\ \alpha | r | - \frac{\alpha^2}{2} & \quad \text{otherwise}\\ \end{cases}\end{split}\]
\[\begin{split}L'(r) = \begin{cases} r & \quad \text{if } | r | \le \alpha \\ \alpha \cdot \textrm{sign}(r) & \quad \text{otherwise}\\ \end{cases}\end{split}\]
L1EpsilonInsLoss
class L1EpsilonInsLoss
ϵ

The \(\epsilon\)-insensitive loss. Typically used in linear support vector regression. It ignores deviances smaller than \(\epsilon\) , but penalizes larger deviances linarily. It is Lipschitz continuous and convex, but not strictly convex.

Lossfunction Derivative
\[L(r) = \max \{ 0, | r | - \epsilon \}\]
\[\begin{split}L'(r) = \begin{cases} \frac{r}{ | r | } & \quad \text{if } \epsilon \le | r | \\ 0 & \quad \text{otherwise}\\ \end{cases}\end{split}\]
L2EpsilonInsLoss
class L2EpsilonInsLoss
ϵ

The \(\epsilon\)-insensitive loss. Typically used in linear support vector regression. It ignores deviances smaller than \(\epsilon\) , but penalizes larger deviances quadratically. It is convex, but not strictly convex.

Lossfunction Derivative
\[L(r) = \max \{ 0, | r | - \epsilon \}^2\]
\[\begin{split}L'(r) = \begin{cases} 2 \cdot \textrm{sign}(r) \cdot \left( | r | - \epsilon \right) & \quad \text{if } \epsilon \le | r | \\ 0 & \quad \text{otherwise}\\ \end{cases}\end{split}\]
PeriodicLoss
class PeriodicLoss
c

Measures distance on a circle of specified circumference \(c\).

Lossfunction Derivative
\[L(r) = 1 - \cos \left ( \frac{2 r \pi}{c} \right )\]
\[L'(r) = \frac{2 \pi}{c} \cdot \sin \left( \frac{2r \pi}{c} \right)\]
QuantileLoss
class QuantileLoss
τ

The quantile loss, aka pinball loss. Typically used to estimate the conditional \(\tau\)-quantiles. It is convex, but not strictly convex. Furthermore it is Lipschitz continuous.

Lossfunction Derivative
\[\begin{split}L(r) = \begin{cases} \left( 1 - \tau \right) r & \quad \text{if } r \ge 0 \\ - \tau r & \quad \text{otherwise} \\ \end{cases}\end{split}\]
\[\begin{split}L(r) = \begin{cases} 1 - \tau & \quad \text{if } r \ge 0 \\ - \tau & \quad \text{otherwise} \\ \end{cases}\end{split}\]

Note

You may note that our definition of the QuantileLoss looks different to what one usually sees in other literature. The reason is that we have to correct for the fact that in our case \(r = \hat{y} - y\) instead of \(r_{\textrm{usual}} = y - \hat{y}\), which means that our definition relates to that in the manner of \(r = -1 * r_{\textrm{usual}}\).

Loss Functions for Classification

Margin-based loss functions are particularly useful for binary classification. In contrast to the distance-based losses, these do not care about the difference between true target and prediction. Instead they penalize predictions based on how well they agree with the sign of the target.

Margin-based Losses

This section lists all the subtypes of MarginLoss that are implemented in this package.

ZeroOneLoss
class ZeroOneLoss

The classical classification loss. It penalizes every missclassified observation with a loss of \(1\) while every correctly classified observation has a loss of \(0\). It is not convex nor continuous and thus seldomly used directly. Instead one usually works with some classification-calibrated surrogate loss, such as one of those listed below.

Lossfunction Derivative
\[\begin{split}L(a) = \begin{cases} 1 & \quad \text{if } a < 0 \\ 0 & \quad \text{otherwise}\\ \end{cases}\end{split}\]
\[L'(a) = 0\]
PerceptronLoss
class PerceptronLoss

The perceptron loss linearly penalizes every prediction where the resulting agreement \(a < 0\). It is Lipschitz continuous and convex, but not strictly convex.

Lossfunction Derivative
\[L(a) = \max \{ 0, - a \}\]
\[\begin{split}L'(a) = \begin{cases} -1 & \quad \text{if } a < 0 \\ 0 & \quad \text{otherwise}\\ \end{cases}\end{split}\]
L1HingeLoss
class L1HingeLoss

The hinge loss linearly penalizes every predicition where the resulting agreement \(a < 1\) . It is Lipschitz continuous and convex, but not strictly convex.

Lossfunction Derivative
\[L(a) = \max \{ 0, 1 - a \}\]
\[\begin{split}L'(a) = \begin{cases} -1 & \quad \text{if } a < 1 \\ 0 & \quad \text{otherwise}\\ \end{cases}\end{split}\]
SmoothedL1HingeLoss
class SmoothedL1HingeLoss
γ

As the name suggests a smoothed version of the L1 hinge loss. It is Lipschitz continuous and convex, but not strictly convex.

Lossfunction Derivative
\[\begin{split}L(a) = \begin{cases} \frac{1}{2 \gamma} \cdot \max \{ 0, 1 - a \} ^2 & \quad \text{if } a \ge 1 - \gamma \\ 1 - \frac{\gamma}{2} - a & \quad \text{otherwise}\\ \end{cases}\end{split}\]
\[\begin{split}L'(a) = \begin{cases} - \frac{1}{\gamma} \cdot \max \{ 0, 1 - a \} & \quad \text{if } a \ge 1 - \gamma \\ - 1 & \quad \text{otherwise}\\ \end{cases}\end{split}\]
ModifiedHuberLoss
class ModifiedHuberLoss

A special (4 times scaled) case of the SmoothedL1HingeLoss with \(\gamma = 2\). It is Lipschitz continuous and convex, but not strictly convex.

Lossfunction Derivative
\[\begin{split}L(a) = \begin{cases} \max \{ 0, 1 - a \} ^2 & \quad \text{if } a \ge -1 \\ - 4 a & \quad \text{otherwise}\\ \end{cases}\end{split}\]
\[\begin{split}L'(a) = \begin{cases} - 2 \cdot \max \{ 0, 1 - a \} & \quad \text{if } a \ge -1 \\ - 4 & \quad \text{otherwise}\\ \end{cases}\end{split}\]
DWDMarginLoss
class DWDMarginLoss
q

The distance weighted discrimination margin loss. A differentiable generalization of the L1 hinge loss that is different than the SmoothedL1HingeLoss

Lossfunction Derivative
\[\begin{split}L(a) = \begin{cases} 1 - a & \quad \text{if } a \le \frac{q}{q+1} \\ \frac{1}{a^q} \frac{q^q}{(q+1)^{q+1}} & \quad \text{otherwise}\\ \end{cases}\end{split}\]
\[\begin{split}L'(a) = \begin{cases} - 1 & \quad \text{if } a \le \frac{q}{q+1} \\ - \frac{1}{a^{q+1}} \left( \frac{q}{q+1} \right)^{q+1} & \quad \text{otherwise}\\ \end{cases}\end{split}\]
L2MarginLoss
class L2MarginLoss

The margin-based least-squares loss for classification, which quadratically penalizes every prediction where \(a \ne 1\). It is locally Lipschitz continuous and strongly convex.

Lossfunction Derivative
\[L(a) = {\left( 1 - a \right)}^2\]
\[L'(a) = 2 \left( a - 1 \right)\]
L2HingeLoss
class L2HingeLoss

The truncated version of the least-squares loss. It quadratically penalizes every predicition where the resulting agreement \(a < 1\) . It is locally Lipschitz continuous and convex, but not strictly convex.

Lossfunction Derivative
\[L(a) = \max \{ 0, 1 - a \} ^2\]
\[\begin{split}L'(a) = \begin{cases} 2 \left( a - 1 \right) & \quad \text{if } a < 1 \\ 0 & \quad \text{otherwise}\\ \end{cases}\end{split}\]
LogitMarginLoss
class LogitMarginLoss

The margin version of the logistic loss. It is infinitely many times differentiable, strictly convex, and Lipschitz continuous.

Lossfunction Derivative
\[L(a) = \ln (1 + e^{-a})\]
\[L'(a) = - \frac{1}{1 + e^a}\]
ExpLoss
class ExpLoss

The margin-based exponential Loss used for classification, which penalizes every prediction exponentially. It is infinitely many times differentiable, locally Lipschitz continuous and strictly convex, but not clipable.

Lossfunction Derivative
\[L(a) = e^{-a}\]
\[L'(a) = - e^{-a}\]
SigmoidLoss
class SigmoidLoss

The so called sigmoid loss is a continuous margin-base loss which penalizes every prediction with a loss within in the range (0,2). It is infinitely many times differentiable, Lipschitz continuous but nonconvex.

Lossfunction Derivative
\[L(a) = 1 - \tanh(a)\]
\[L'(a) = - \textrm{sech}^2 (a)\]

Internals

If you are interested in contributing to LossFunctions.jl, or simply want to understand how and why the package does then take a look at our developer documentation.

Developer Documentation

Abstract Superclasses

Most of the implemented losses fall under the category of supervised losses. In other words they represent functions with two parameters (the true targets and the predicted outcomes) to compute their value.

class SupervisedLoss

Abstract subtype of Loss. A loss is considered supervised, if all the information needed to compute value(loss, features, targets, outputs) are contained in targets and outputs, and thus allows for the simplification value(loss, targets, outputs).

class DistanceLoss

Abstract subtype of SupervisedLoss. A supervised loss that can be simplified to L(targets, outputs) = L(targets - outputs) is considered distance-based.

class MarginLoss

Abstract subtype of SupervisedLoss. A supervised loss, where the targets are in {-1, 1}, and which can be simplified to L(targets, outputs) = L(targets * outputs) is considered margin-based.

Shared Interface

value(loss, agreement)

Computes the value of the loss function for each observation in agreement individually and returns the result as an array of the same size as the parameter.

Parameters:
  • loss (MarginLoss) – An instance of the loss we are interested in.
  • agreement (AbstractArray) – The result of multiplying the true targets with the predicted outputs.
Returns:

The value of the loss function for the elements in agreement.

Return type:

AbstractArray

deriv(loss, agreement)

Computes the derivative of the loss function for each observation in agreement individually and returns the result as an array of the same size as the parameter.

Parameters:
  • loss (MarginLoss) – An instance of the loss we are interested in.
  • agreement (AbstractArray) – The result of multiplying the true targets with the predicted outputs.
Returns:

The derivatives of the loss function for the elements in agreement.

Return type:

AbstractArray

value_deriv(loss, agreement)

Returns the results of value() and deriv() as a tuple. In some cases this function can yield better performance, because the losses can make use of shared variable when computing the values.

Shared Interface

value(loss, difference)

Computes the value of the loss function for each observation in difference individually and returns the result as an array of the same size as the parameter.

Parameters:
  • loss (DistanceLoss) – An instance of the loss we are interested in.
  • difference (AbstractArray) – The result of subtracting the true targets from the predicted outputs.
Returns:

The value of the loss function for the elements in difference.

Return type:

AbstractArray

deriv(loss, difference)

Computes the derivative of the loss function for each observation in difference individually and returns the result as an array of the same size as the parameter.

Parameters:
  • loss (DistanceLoss) – An instance of the loss we are interested in.
  • difference (AbstractArray) – The result of subtracting the true targets from the predicted outputs.
Returns:

The derivatives of the loss function for the elements in difference.

Return type:

AbstractArray

value_deriv(loss, difference)

Returns the results of value() and deriv() as a tuple. In some cases this function can yield better performance, because the losses can make use of shared variable when computing the values.

Regression vs Classification

We can further divide the supervised losses into two useful sub-categories: DistanceLoss for regression and MarginLoss for classification.

Losses for Regression

Supervised losses that can be expressed as a univariate function of output - target are referred to as distance-based losses.

value(L2DistLoss(), difference)

Distance-based losses are typically utilized for regression problems. That said, there are also other losses that are useful for regression problems that don’t fall into this category, such as the PeriodicLoss.

Note

In the literature that this package is partially based on, the convention for the distance-based losses is target - output (see [STEINWART2008] p. 38). We chose to diverge from this definition because it would force a difference between the results for the unary and the binary version of the derivative.

Losses for Classification

Margin-based losses are supervised losses where the values of the targets are restricted to be in \(\{1,-1\}\), and which can be expressed as a univariate function output * target.

value(L1HingeLoss(), agreement)

Note

Throughout the codebase we refer to the result of output * target as agreement. The discussion that lead to this convention can be found issue #9

Margin-based losses are usually used for binary classification. In contrast to other formalism, they do not natively provide probabilities as output.

Deviations from Literature

Writing Tests

Indices and tables

Acknowledgements

The basic design of this package is heavily modelled after the loss-related definitions in [STEINWART2008].

We would also like to mention that some early inspiration was drawn from EmpiricalRisks.jl

References

[STEINWART2008]Steinwart, Ingo, and Andreas Christmann. “Support vector machines”. Springer Science & Business Media, 2008.

LICENSE

The LossFunctions.jl package is licensed under the MIT “Expat” License

see LICENSE.md in the Github repository.