The Dual Representation of ReLU Networks

In this post, we introduce the dual representation of fully connected feedforward ReLU networks. This is part one of a three part series on the geometry of generalization of deep neural networks.

The Dual Representation

This is part one of a three part series on the geometry of generalization, which is the essence of my master thesis. In this series, we will present a novel perspective on generalization of overparameterized networks.

The series is structured in the following way:

Introduction

The constructions put forward in this post are inspired by Piwek et al.. In particular, we will establish a dual representation of fully connected feedforward ReLU networks, a symbolic representation that allows thinking about ReLU networks and their complexity in an abstract setting. This will be useful in part two and three of this series.

The dual representation has been established in a number of previous works and utilizes tropical geometry. While this is an interesting theory, one can use the close relationship between tropical geometry and affine geometry to make the constructions easier to interpret.

However, throughout this series, I will refrain from proving all statements in detail. In particular, one can quickly forget about tropical geometry and its relationship to affine geometry. The curious reader may refer to Chapter 3 and Chapter 4 of my thesis, in particular Section 4.2, for more details on this relationship.

Finally, for every claim made in this series, I will refer to the corresponding statement in the thesis. There, one can find a proof as well as a reference to the corresponding statement by Piwek et al. wherever appropriate.

ReLU Networks

To introduce notation and get everybody on board, we start by quickly defining fully connected feedforward ReLU networks.

(Fully Connected Feedforward Networks) A fully connected feedforward network $\mathcal N := \mathbb R^{d} \to \mathbb R^{n_L}$ takes as an input a vector $\mathbf x \in \mathbb R^{d}$ and returns an output $\mathbf y := \mathbf a_L$. It is defined inductively by $$ \left\{ \begin{aligned} \mathbf a_0 &:= \mathbf x \\ \mathbf a_{l+1} &= \rho_{t_{l+1}}\left({\mathbf W}_{l+1} \mathbf a_l + \mathbf b_{l+1}\right), \; 0 = 1,\ldots,L-1, \end{aligned} \right. $$ where $\mathbf W_{l+1} \in \mathbb R^{n_{l+1}, n_{l}}$ and $\mathbf b_{l+1} \in \mathbb R^{n_{l+1}}$ are the weight matrix and bias vector at layer $l+1$. Furthermore, $\rho_{t_{l+1}}(x) = \max(x, t_{l-1})$ is the activation function at layer $l+1$ with threshold $t_l \in \mathbb R \cup \{-\infty\}$. The number $L$ is called the depth of the network, while $n_l$ is the width of layer $l$. The network is deep if $L \gg 1$.

The two activation functions we consider are:

This leads to the following definition:

(ReLU Networks) A network in the sense of Definition with ReLU activations (and potentially a linear activation at the last layer) is called a ReLU network.

Affine Geometry

In this section, we introduce fundamental concepts of affine geometry, covering basic definitions, the dual representation of affine functions, and their connection to upper convex hulls. Throughout this section, fix an integer $d \in \mathbb N$.

Affine and CPA Functions

We begin with some fundamental concepts.

(Affine Functions) Given a vector $\mathbf{a} \in \mathbb R^d$ and a scalar $b \in \mathbb R$, we define the affine function with parameters $\mathbf a$ and $b$ as $$ \begin{align*} f_{\mathbf a, b} \colon \mathbb R^d &\to \mathbb R \\ \mathbf x &\mapsto \langle \mathbf a, \mathbf x \rangle + b, \end{align*} $$ where $\langle \cdot, \cdot \rangle$ is the Euclidean inner product on $\mathbb R^d$.

Ultimately, we will be taking maxima over affine maps. To classify such maxima, we introduce CPA functions:

(CPA Functions) We say that a function $f \colon \mathbb R^d \to \mathbb R$ is CPA if it is convex and piecewise affine. We denote by $\text{CPA}(d)$ the set of CPA functions $\mathbb R^d \to \mathbb R$.

It turns out that the class of CPA functions coincides with the class of maxima over affine functions:

(Characterizing CPA Functions, Proposition 2 in Piwek et al.) Every function $F \colon \mathbb R^d \to \mathbb R$ of the form $$ \begin{equation*} F(\mathbf x) = \max\{f_1(\mathbf x),\ldots,f_n(\mathbf x)\} \end{equation*} $$ with affine functions $f_i \colon \mathbb R^d \to \mathbb R$ is CPA. Also every CPA function with a finite number of affine pieces is of this form.

Later in this series, we will also consider differences of CPA functions:

(DCPA Functions) We say that a function $f \colon \mathbb R^d \to \mathbb R$ is DCPA if it can be written as the difference of two CPA functions. We denote by $\text{DCPA}(d)$ the set of DCPA function $\mathbb R^d \to \mathbb R$.

Affine Dualities

In this section, we mainly follow the argument presented by Piwek et al., which allows mapping an affine function $f_{\mathbf a, b} \colon \mathbb R^d \to \mathbb R$ to a dual space. As an outlook, exploring this transformation will ultimately lead to understanding how ReLU networks can be understood as DCPA functions.

The graph of an affine function $\mathbb R^d \to \mathbb R$ defines a hyperplane in real space, which we define as $\mathcal R := \mathbb R^d \times \mathbb R = \mathbb R^{d+1}$. The space of affine functions whose graph lies in $\mathcal R$ is called real affine space, denoted by $\text{Aff}_{\mathfrak R}(d)$.

As mentioned in Definition, any affine function $f_{\mathbf{a},b} \in \text{Aff}_{\mathfrak{R}}(d)$ is characterized by its parameters $(\mathbf a, b) \in \mathbb R^{d+1}$:

We refer to the copy of $\mathbb R^{d+1}$ that parametrizes affine functions in $\text{Aff}_{\mathfrak{R}}(d)$ as dual space, denoted by $\mathcal D$.

The following lemma is a natural consequence of this construction, as it allows translating between real affine space and dual space:

(Lemma 3.2.1 in Nazari) For any fixed dimension $d$, there exists a bijection between dual space and real affine space, given by $$ \begin{align*} \mathcal R \colon \mathcal D &\xrightarrow{\sim} \text{Aff}_{\mathfrak R}(d) \\ (\mathbf x, y) &\mapsto f_{\mathbf x, y}. \end{align*} $$
(a)
(b)
Example of the dual representation of an affine map $f{\mathbf a, b}$ with $\mathbf a = (-1/2, -3/4)$, $b=3/4$. Subfigure (b) contains the graph of $f_{\mathbf a,b}$ and Subfigure (a) contains the parameterizing dual point $(\mathbf a, b) \in \mathcal D$. The map $\mathcal R$ assigns to the point $(\mathbf a,b)$ the affine map $f_{\mathbf a, b}$.

An example for $\mathcal R$ can be found in Figure. It has the following properties.

(Proposition 3.2.2 in Nazari) Let $\{\mathbf x_i, y_i\}_{i=1,\ldots,n} \subseteq \mathcal D$ be a set of dual points. Then the following are true:
  1. $\mathcal R$ is a linear operator, i.e., for any set of scalars $\{\alpha_i\}_{i=1,\ldots,n} \subseteq \mathbb R$, $$ \begin{equation*} \mathcal R\left(\sum_{i=1}^n \alpha_i (\mathbf x_i, y_i)\right) = \sum_{i=1}^n \alpha_i \mathcal R((\mathbf x_i,y_i)). \end{equation*} $$
  2. The set of dual points is linearly independent if and only if the corresponding set $\{\mathcal R((\mathbf x_i,y_i))\}_{i=1,\ldots,n}$ of affine functions is linearly independent.
  3. The set of dual points is affinely independent if and only if the corresponding set $\{\mathcal R((\mathbf x_i,y_i))\}_{i=1,\ldots,n}$ of affine functions is affinely independent.

Since both $\mathcal R$ and $\mathcal D$ are copies of $\mathbb R^{d+1}$, it is natural to ask whether we can reverse their roles in the above construction. The answer to this question is yes. We define dual affine space $\text{Aff}_{\mathfrak D}(d)$ as the space of affine functions with graph in $\mathcal D$. Analogously to the above construction, these affine functions are parameterized by points in $\mathcal R$, though with a slight caveat:

For any fixed dimension $d$, there exists a bijection between dual affine space and real space. It is given by $$ \begin{align*} \check{\mathcal R} \colon \text{Aff}_{\mathfrak D}(d) &\xrightarrow{\sim} \mathcal R \\ f_{\mathbf a, b} &\mapsto (-\mathbf a, b). \end{align*} $$
Figure provides an overview over the relationship between $\mathcal R, \mathcal D, \text{Aff}_{\mathfrak R}(d)$ and $\text{Aff}_{\mathfrak D}(d)$.
Diagram indicating the relationship between real (affine) and dual (affine) space.

Note that, compared to $\mathcal R$, the function $\check{\mathcal R}$ includes an additional minus and maps in the opposite direction. This is essential for ensuring that the duality properties in the following proposition hold:

(Duality Properties, Proposition 7 in Piwek et al.) The maps $\mathcal R$ and $\check{\mathcal R}$ have the following properties:
  1. A dual point $\mathbf c \in \mathfrak D$ lies on the graph of a dual affine function $f_{\mathbf a,b} \in \text{Aff}_{\mathfrak D}(d)$ if and only if the graph of the corresponding real affine function $\mathcal R(\mathbf c)$ contains the corresponding real point $\check{\mathcal R}(f_{\mathbf a,b})$.
  2. A dual point $\mathbf c \in \mathcal D$ lies above the graph of a dual affine function $f_{\mathbf a,b} \in \text{Aff}_{\mathfrak D}(d)$ if and only if the real point $\check{\mathcal R}(f_{\mathbf a,b})$ lies below the graph of $\mathcal R(c)$.

CPA Functions as Upper Convex Hulls

In the previous section, we explored a duality that allows identifying affine maps with the vector containing their parameters. In this section, we apply these results to maxima over affine functions, which, by Proposition, can be understood as CPA functions.

In light of the duality results from the previous section, CPA functions correspond to finite sets of dual points:

On the set $\mathcal P_\text{fin}(\mathcal D)$ of finite subsets of $\mathcal D$, the operator $$ \begin{align*} \mathcal Q \colon \mathcal P_\text{fin}(\mathcal D) &\to \text{CPA}(d)\\ S &\mapsto \mathcal Q(S) := \max_{\mathbf s \in S} \mathcal R(\mathbf s) \end{align*} $$ assigns to a set of dual points the associated CPA function $$ \begin{align*} \max_{\mathbf s \in S}\mathcal R(\mathbf s)(\mathbf x) = \max_{(\mathbf a, b) \in S} \langle \mathbf x, \mathbf a \rangle + b. \end{align*} $$ We define $\mathcal Q(\emptyset) := 0$. On a vector of finite sets of dual points, $Q$ acts component-wise.

Note that, by Proposition, the operator $\mathcal Q$ does indeed map to $\text{CPA}(d)$.

Example of an upper convex hull. Let $S$ be the union of all displayed points. The blue points correspond to $\mathcal U^*(S)$. In particular, $\mathcal Q(S)$ is uniquely identified by those points.

Our next objective is to establish a connection between CPA functions and upper convex hulls. To begin, we first state the following proposition:

(Maximality of Upper Convex Hull, Proposition 3.3.2 in Nazari) Let $S \subseteq \mathcal D$ be a finite set of points. Then for every point $w \in \mathcal D$ lying below or on $\mathcal U(S)$, the affine function dual to $w$ lies fully below the maximum of the affine functions whose duals lie in $\mathcal U^*(S)$. That is, $$ \begin{equation} \mathcal R(w) \leq \max\{\mathcal R(s) | s \in \mathcal U^*(S)\} = \mathcal Q(\mathcal U^*(S)). \end{equation} $$ If $w$ lies truly below $\mathcal U(S)$, then even $$ \begin{equation} \mathcal R(w) < \mathcal Q(\mathcal U^*(S)). \end{equation} $$

Having established this proposition, the identification of CPA functions with upper convex hulls is a corollary:

(CPAs as Upper Convex Hulls, Corollary 3.3.3 in Nazari) Every CPA function $\mathcal Q(S)$ can be uniquely represented as an upper convex hull in dual space. That is, $\mathcal Q(S) = \mathcal Q(\mathcal U^*(S))$.

A visualization of Corollary can be found in Figure.

Dual Representation of Neural Networks

Using the above constructions, we will now establish a connection between fully connected feedforward networks and DCPA functions. This will ultimately enable us to translate the network to dual space.

Neural Networks and Affine Geometry

In order to establish the connection, we first need to develop some more machinery, beginning with the definition of how to sum two sets:

Given two non-empty sets $X, Y \subseteq \mathbb R^{d+1}$, we define $$ \begin{equation*} X \style{transform: rotate(45deg); display: inline-block;}{\boxtimes} Y := \{\mathbf x + \mathbf y | \mathbf x \in X, \mathbf y \in Y\} \end{equation*} $$ to be the Minkowski sum of $X$ and $Y$. We define $X \style{transform: rotate(45deg); display: inline-block;}{\boxtimes} \emptyset := X$. On vectors of sets of dual points, we define $\style{transform: rotate(45deg); display: inline-block;}{\boxtimes}$ to act component-wise.

Next, we list some properties of the operator $\mathcal Q$ (see Definition), which assigns to a set of dual points the corresponding CPA function:

(Properties of $\mathcal Q$, Lemma 5.1.2 in ) For any two sets of points $X, Y \subseteq \mathcal D$ and every non-negative scalar $\alpha \geq 0$, the following are true:
  1. $\mathcal Q(X \cup Y) = \max\{\mathcal Q(X), \mathcal Q(Y)\}$
  2. $\mathcal Q(X \style{transform: rotate(45deg); display: inline-block;}{\boxtimes} Y) = \mathcal Q(X) + \mathcal Q(Y)$
  3. $\alpha \cdot \mathcal Q = \mathcal Q(\alpha \cdot X)$, where the multiplication on the right hand side is the natural multiplication of a set with a real number.

Neural networks rely heavily on matrix multiplications. On our mission to translate them to dual space, we must establish the concept of matrix multiplication in the dual setting:

We define the multiplication of an $m \times n$ matrix $\mathbf A$ with a vector $X$ of $n$ finite sets of dual points as $$ \begin{align*} \cdot \colon \mathbb R^{m, n} \times \left(P_\text{fin}(\mathcal D)\right)^n &\to \left(P_\text{fin}(\mathcal D)\right)^m\\ (\mathbf A, X) &\mapsto \mathbf A \cdot X \end{align*} $$ where $$ (\mathbf A \cdot X)_i := {\style{transform: rotate(45deg); display: inline-block; font-size: 150%;}{\boxtimes}}_{j=1}^{n} \mathbf A_{ij} \cdot X_j \quad \forall i=1,\ldots,m. $$ In the notation above, $\left(P_\text{fin}(\mathcal D)\right)^n$ denotes the $n$-fold Cartesian product of $P_\text{fin}(\mathcal D)$ with itself and ${\style{transform: rotate(45deg); display: inline-block; font-size: 150%;}{\boxtimes}}_{j=1}^n$ denotes the Minkowski sum over the sets indexed by $\{1,\ldots,n\}$.

The following lemma shows that matrix multiplication and the $\mathcal Q$-operator commute:

(Matrix Multiplication, Lemma 5.1.4 in Nazari) Let $X \in \left(P_\text{fin}(\mathcal D)\right)^n$ be a vector of finite sets of dual points and $\mathbf A \in \mathbb R_+^{m, n}$ a matrix with non-negative entries. Then $$ \begin{equation*} \mathbf A \mathcal Q(X) = \mathcal Q(\mathbf A \cdot X). \end{equation*} $$

In order to account for biases, we define how to add a scalar to a set of dual points:

A scalar can be added to a set of dual points by adding the scalar to the last entry of each point in the set: $$ \begin{align*} \boxplus \colon P_\text{fin}(\mathcal D) \times \mathbb R &\to P_\text{fin}(\mathcal D) \\ (X, \alpha) &\mapsto X \boxplus \alpha, \end{align*} $$ where $X \boxplus \alpha$ is the set $$ \begin{equation*} X \boxplus \alpha := \{(\mathbf x, y + \alpha) | (\mathbf x, y) \in X\}. \end{equation*} $$

It turns out that $\mathcal Q$ is also well behaved with respect to $\boxplus$:

(Lemma 5.1.6 in Nazari) For any finite set of dual points $X \subseteq \mathcal D$ and scalar $\alpha \in \mathbb R$, it holds that $$ \begin{equation*} \mathcal Q(X) + \alpha = \mathcal Q(X \boxplus \alpha). \end{equation*} $$

We are now ready to present the following fundamental proposition that establishes the connection between ReLU networks and differences of piecewise affine functions:

(Dual Representation, Proposition 5.1.7 in ) Assume that a neural network in the sense of Definition can, up to layer $l-1$, be written as a DCPA function ${\mathbf a}_{l-1} = \mathcal Q(P_{l-1})- \mathcal Q(N_{l-1})$ for some vectors of finite sets of dual points $P_{l-1}, N_{l-1}$. Then, after writing ${\mathbf W}_l = {\mathbf W}_l^+ - {\mathbf W}_l^-$ using matrices ${\mathbf W}_l^+$ and ${\mathbf W}_l^-$ with non-negative entries, also the network up to the $l$'th layer can be written as a DCPA function $$ \begin{equation*} {\mathbf a}_l = \mathcal Q(P_l) - \mathcal Q(N_l) \end{equation*} $$ with $$ \begin{align*} N_l &= ({\mathbf W}_l^- \cdot P_{l-1}) \style{transform: rotate(45deg); display: inline-block;}{\boxtimes} ({\mathbf W}_l^+ \cdot N_{l-1}) \\ P_l &= \left( \left(\left({\mathbf W}_l^+ \cdot P_{l-1}) \style{transform: rotate(45deg); display: inline-block;}{\boxtimes} ({\mathbf W}_l^- \cdot N_{l-1}\right)\right) \boxplus {\mathbf b}_l \right) \cup \begin{cases} N_l \boxplus t_l, & t_l \neq - \infty \\ \emptyset, & t_l = -\infty. \end{cases} \end{align*} $$

The following corollary makes sure Proposition can actually be applied to ReLU networks by establishing the base case:

Every neural network $\mathcal N$ in the sense of Definition can be written as a DCPA function $$ \begin{equation*} \mathcal N = \mathcal Q(P) - \mathcal Q(N) \end{equation*} $$ for some vectors of sets of dual points $P, N \subseteq \mathcal D$. We call $(P, N)$ the dual representation of $\mathcal N$.

The dual representation is denoted by $(P, N)$ since those two sets tell us which points are classified positively and negatively:

(Positive and negative samples, Proposition 5.1.11 in Nazari) Let $\mathcal N = \mathcal Q(P) - \mathcal Q(N)$ be a ReLU binary classification network. Then the following are true for an input $\mathbf x \in \mathbb R^d$: $$ \begin{align} \mathcal N(\mathbf x) \geq 0 &\iff \mathcal Q(P \cup N)(\mathbf x) = \mathcal Q(P)(\mathbf x) \\ \mathcal N(\mathbf x) \leq 0 &\iff \mathcal Q(P \cup N)(\mathbf x) = \mathcal Q(N)(\mathbf x). \end{align} $$

The following remark summarizes the dual representation and why it is relevant.

Proposition and Corollary are important tools throughout the rest of this series. We want to use this remark to highlight their significance. Let $\mathcal N \colon \mathbb R^d \to \mathbb R$ be a ReLU network with $L$ layers.
  1. Given the weights of $\mathcal N$, the dual representation provides a symbolic representation $(P_l, N_l)$ of $\mathcal N$ up to layer $l$.
  2. It is given by two $n_l$-dimensional vectors $P_l, N_l$ of finite sets of dual points, where $n_l$ is the width of layer $l$. The dual points are $d+1$-dimensional.
  3. After each layer $l$, the sets of dual points can be replaced by their upper convex hull (see Corollary). In particular, for every $i \in \{1,\ldots,n_l\}$, the set $(P_l)_i \subseteq \mathcal D = \mathbb R^{d+1}$ can be replaced by its upper convex hull vertices $\mathcal U^*((P_l)_i)$. The same holds for $(N_l)_i$.
  4. As we will see later, this symbolic representation allows counting the number of affine regions defined by $\mathcal N$. In the case of binary classification, it furthermore allows counting the linear pieces in the decision boundary (Post 2).

Scattered throughout this series, we employ a running example to highlight the above points.

Example

In this example, we construct the dual representation of a toy example in two dimensions (see Example 5.2.6 in Nazari). Throughout this series, we will revisit and use this example to explain various aspect of the discussed duality result.

(a)
(b)
Subfigure (a) shows the affine regions defined by the network defined in Equation \eqref{eq:tropical-toy-example}. Subfigure (b) shows its decision boundary. Negatively classified regions (threshold at $0$) are colored red and positively classified regions are colored blue.

Specifically, consider the $3$ layer network

\[\begin{align} \label{eq:tropical-toy-example} \mathcal N \colon \mathbb R^2 &\to \mathbb R \\ {\mathcal N}(x) &= {\mathbf W}_3\rho_0\left({\mathbf W}_2\rho_0\left({\mathbf W}_1 \mathbf x + {\mathbf b}_1\right) + {\mathbf b}_2\right) + b_3 \end{align}\]

where

\[\begin{align*} {\mathbf W}_1 &= \begin{pmatrix} -1 & -1 \\ 1 & -2 \end{pmatrix}, {\mathbf b}_1 = \begin{pmatrix} 1 \\ -1 \end{pmatrix}, {\mathbf W}_2 = \begin{pmatrix} -1 & 2 \\ 2 & -1 \end{pmatrix}, {\mathbf b}_2 = \begin{pmatrix} 1 \\ 2 \end{pmatrix}, {\mathbf W}_3 = \begin{pmatrix} 3, -1 \end{pmatrix}, b_3 = 2. \end{align*}\]

We use Proposition to iteratively construct the dual representation of $\mathcal N$, starting with $P_0 = ({(1,0,0)}, {(0,1,0)})$ and $N_0 = (\emptyset)$, as in the proof of Corollary.

After the first layer, the dual representation of $N_1$ can be computed as

\[\begin{align*} N_1 &= {\mathbf W}_1^- P_0 \style{transform: rotate(45deg); display: inline-block;}{\boxtimes} {\mathbf W}_1^+ N_0 = \begin{pmatrix} 1 & 1 \\ 0 & 2 \end{pmatrix} P_0 \end{align*}\]

and thus

\[\begin{align*} (N_1)_1 &= {\style{transform: rotate(45deg); display: inline-block; font-size: 150%;}{\boxtimes}}_{j=1}^2 ({\mathbf W}_1^-)_{1j} (P_0)_j = 1\left\{\left((1,0,0)\right)\right\} \style{transform: rotate(45deg); display: inline-block;}{\boxtimes} 1\left\{\left(0,1,0\right)\right\} = \left\{\left(1,1,0\right)\right\} \\ (N_1)_2 &= {\style{transform: rotate(45deg); display: inline-block; font-size: 150%;}{\boxtimes}}_{j=1}^2 ({\mathbf W}_1^-)_{2j} (P_0)_j = 0\left\{\left((1,0,0)\right)\right\} \style{transform: rotate(45deg); display: inline-block;}{\boxtimes} 2\left\{\left(0,1,0\right)\right\} = \left\{\left(0,2,0\right)\right\}, \end{align*}\]

which implies

\[\begin{equation*} N_1 = \left(\left\{\left(1,1,0\right)\right\}, \left\{\left(0,2,0\right)\right\}\right). \end{equation*}\]

Similarly,

\[\begin{align*} P_1 &= {\mathbf W}_1^+ P_0 \style{transform: rotate(45deg); display: inline-block;}{\boxtimes} {\mathbf W}_1^- N_0 \boxplus {\mathbf b}_1 \cup N_1 = \begin{pmatrix} 0 & 0 \\ 1 & 0 \end{pmatrix} P_0 \boxplus {\mathbf b}_1 \cup N_1 \end{align*}\]

and thus

\[\begin{align*} (P_1)_1 &= {\style{transform: rotate(45deg); display: inline-block; font-size: 150%;}{\boxtimes}}_{j=1}^{2} ({\mathbf W}_1^+)_{1j} (P_0)_j \boxplus ({\mathbf b}_1)_1 \cup (N_1)_1 = 0\left\{\left((1,0,0)\right)\right\} \style{transform: rotate(45deg); display: inline-block;}{\boxtimes} 0\left\{\left(0,1,0\right)\right\} \boxplus 1 \cup (N_1)_1 \\ &= \left\{\left(0,0,1\right), \left(1,1,0\right)\right\} \\ (P_1)_2 &= {\style{transform: rotate(45deg); display: inline-block; font-size: 150%;}{\boxtimes}}_{j=1}^{2} ({\mathbf W}_1^+)_{2j} (P_0)_j \boxplus ({\mathbf b}_1)_2 \cup (N_1)_2 = 1\left\{\left((1,0,-1)\right)\right\} \style{transform: rotate(45deg); display: inline-block;}{\boxtimes} 0\left\{\left(0,1,0\right)\right\} \boxplus -1 \cup (N_1)_2 \\ &= \left\{\left(1,0,0\right), \left(0, 2, 0\right)\right\}, \end{align*}\]

which implies

\[\begin{equation*} P_1 = \left(\left\{\left(0,0,1\right),\left(1,1,0\right)\right\}, \left\{\left(0,2,0\right),\left(1,0,-1\right)\right\}\right). \end{equation*}\]

After repeating these steps for layer $2$ and $3$ (with a slight adaptation for the last linear layer as in Proposition), one arrives at the following final dual representation of $\mathcal N = \mathcal Q(P) - \mathcal Q(N)$:

\[\begin{align*} N &= \left\{(3, 17, 4), (2, 16, 5), (5, 19, 2), (3, 14,2), (2,16,3),(5,19,0),(6,17,-1),(0,14,7)\right\} \\ P &= \left\{(2,16,5),(5,19,2),(5,19,5),(11,7,-1),(12,5,-2),(3,14,4),(6,17,1),(6,17,4)\right\}. \end{align*}\]

Here, $N := N_3$ and $P := P_3$. Additionally, note that we have identified the one-dimensional vectors of sets $N_3$ and $P_3$ with their only entry.

By Corollary, the CPA functions $\mathcal Q(N)$ and $\mathcal Q(P)$ are uniquely identified by the upper convex hulls of $N$ and $P$, i.e.,

\[\begin{equation*} \mathcal Q(N) = \mathcal Q(\mathcal U^*(N)), \quad \mathcal Q(P) = \mathcal Q(\mathcal U^*(P)). \end{equation*}\]

This allows restricting our attention to subsets of $N$ and $P$. Specifically, the upper convex hull points can be determined\footnote{We use SciPy to do so, see the code for more details.} as

\[\begin{align*} \mathcal U^*(N) &= \left\{(5, 19, 2), (3, 14, 2), (6, 17, -1), (0, 14, 7)\right\} \\ \mathcal U^*(P) &= \left\{(2,16,5), (3,14,4), (5,19,5), (12,5,-2)\right\}. \end{align*}\]

Figure contains a plot of the dual representation of this toy example, as well as the upper convex hulls.

Dual representation of the two-dimensional toy-example defined in Equation \eqref{eq:tropical-toy-example}. Red points correspond to $N$, blue points are $P$. The red polygon is $\mathcal U(N)$, the blue polygon is $\mathcal U(P)$. Note that, in theory, both $\mathcal U(N)$ and $\mathcal U(P)$ are polyhedral complexes, i.e., they can consist of multiple facets.



If you found this useful, please cite this as:

Nazari, Philipp (Jun 2025). The Dual Representation of ReLU Networks. https://phnazari.github.io.

or as a BibTeX entry:

@article{nazari2025the-dual-representation-of-relu-networks,
  title   = {The Dual Representation of ReLU Networks},
  author  = {Nazari, Philipp},
  year    = {2025},
  month   = {Jun},
  url     = {https://phnazari.github.io/blog/2025/dual-representation/}
}