From c47c5c9e4f206e77db2a93ad8647c2569c6d7e1c Mon Sep 17 00:00:00 2001
From: fdai0234 <alexander.krawczyk@informatik.hs-fulda.de>
Date: Thu, 28 Sep 2023 10:21:28 +0200
Subject: [PATCH] minor fixes

---
 iclr2024_conference.tex | 42 +++++++++++++++++++----------------------
 1 file changed, 19 insertions(+), 23 deletions(-)

diff --git a/iclr2024_conference.tex b/iclr2024_conference.tex
index 0297167..9883ff3 100644
--- a/iclr2024_conference.tex
+++ b/iclr2024_conference.tex
@@ -158,11 +158,11 @@
 	\end{figure}
 	%
 	The main techniques used in the experiments of this article are adiabatic replay (AR), experience replay (ER), deep generative replay (DGR) and foundation models.
-	We refer to the appendix for details and experimental settings concerning ER (\cref{app:er}), DGR (\cref{app:dgr}) and the encoding of data by foundation models (\cref{app:fm}), whereas we will discuss the details of AR in this section. 
+	We refer to the appendix for details on the experimental settings concerning ER (\cref{app:er}), DGR (\cref{app:dgr}) and the encoding of data by foundation models (\cref{app:fm}), whereas we will discuss the details of AR in this section. 
 	%
-	\subsection{Adiabatic replay(AR)}
-	%
-	In contrast to conventional replay, where a scholar is composed of a generator and a solver network, see \cref{fig:genrep}, AR proposes scholars where a single network acts as a geneator as well as a feature generator for the solver.
+	\subsection{Adiabatic replay (AR)}
+	% TODO: "as a generator as well as a feature generator for the solver" könnte etwas unklar sein!
+	In contrast to conventional replay, where a scholar is composed of a generator and a solver network, see \cref{fig:genrep}, AR proposes scholars where a single network acts as a generator as well as a feature generator for the solver.
 	Assuming a suitable scholar (see below), the high-level logic of AR is shown in \cref{fig:var}: Each sample from a new task is used to \textit{query} the scholar, which generates a similar, known sample. Mixing new and generated samples in a defined, constant proportion creates the training data for the current task.
 	A new sample will cause adaptation of the scholar in a localized region of data space. Variants generated by that sample will, due to similarity, cause adaptation in the same region. Knowledge in the overlap region will therefore be adapted to represent both, while dissimilar regions stay unaffected (see \cref{fig:var} for a visual impression).
 	
@@ -171,7 +171,7 @@
 	% GMMs
 	\par\noindent\textbf{Selective updating}
 	is an intrinsic property of GMMs. They describe data distributions by a set of $K$ \textit{components}, consisting of component weights $\pi_k$, centroids $\vmu_k$ and covariance matrices $\mSigma_k$. A data sample $\vx$ is assigned a probability $p(\vx) = \sum_k \pi_k \mathcal N(\vx ; \vmu_k, \mSigma_k)$ as a weighted sum of normal distributions $\mathcal N(\vx; \vmu_k, \mSigma_k)$. Training of GMMs is performed as detailed in \cite{gepperth2021gradient} by adapting centroids, covariance matrices and component weights through the SGD-based minimization of the negative log-likelihood $\mathcal L =\sum_n \log \sum_k \pi_k \mathcal N(\vx_n; \vmu_k,\mSigma_k)$.	
-    As shown in \cite{gepperth2021gradient}, this expression is strongly dominated by a single GMM component $k^*$, and can be approximated as $-\log (\pi_{k^*} \mathcal N(\vx; \vmu_{k^*}, \mSigma_{k^*}))$. This implies that the best-matching GMM component $k^*$ is the only component that selectively adapted.	
+    As shown in \cite{gepperth2021gradient}, this expression is strongly dominated by a single GMM component $k^*$, and can be approximated as $-\log (\pi_{k^*} \mathcal N(\vx; \vmu_{k^*}, \mSigma_{k^*}))$. This implies that the best-matching GMM component $k^*$ is the only component that selectively adapted.
 	% VARIANTS
 	\par\noindent\textbf{Selective replay}
 	is a form of sampling from the probability density represented by a trained GMM, see \cite{gepperth2021image}.
@@ -183,7 +183,7 @@
 	To reduce noise, top-S sampling is introduced, where only the $S=2$ highest values of the responsibilities are used for selection.
 	% CLASSIFY
 	\par\noindent\textbf{Solver}
-	functions are performed by feeding GMM responsibilities into a bias-free, linear regression layer as $\vo(\vx_n) = \mW\vgamma(\vx_n)$.
+	functions are performed by feeding GMM responsibilities into a linear regression layer as $\vo(\vx_n) = \mW\vgamma(\vx_n)$.
 	We use a MSE loss and drop the bias term to reduce the sensitivity to unbalanced classes.
 	% TRAIN
 	\par\noindent\textbf{GMM training}
@@ -198,23 +198,22 @@
 	\par\noindent\textbf{Fruits-360}~\cite{fruits} contains 100x100 images showing different types of fruits, from which we chose the 10 best-represented classes and downsample to 32x32 RGB.
 	\par\noindent\textbf{SVHN}~\cite{netzer2011reading} contains 60.000 RGB images of house numbers ($0$-$9$, resolution $32$\,$\times$\,$32$).
 	\par\noindent\textbf{CIFAR-10}~\cite{krizhevsky2009learning} contains 60.000 RGB images of natural objects, resolution 32x32, in 10 balanced classes.
-	    
-    CL problems formed from these datasets according to the default class-incremental scenario (\cref{sec:intro}) have been shown to be far from the theoretically optimal performance, see, e.g., \cite{pfulb2019comprehensive,mir}. CL tasks formed used in the literature are D9-1, D5$^2$, D2$^5$ or D1$^{10}$. Expressions like D$2^5$ are to be read as D2-2-2-2-2.
+	
+    CL problems formed from these datasets according to the default class-incremental scenario (\cref{sec:intro}) have been shown to be far from the theoretically optimal performance, see, e.g., \cite{pfulb2019comprehensive,mir}. CL tasks typically used in the literature are D9-1, D5$^2$, D2$^5$ or D1$^{10}$. Expressions like D$2^5$ are to be read as D2-2-2-2-2.
 	%
-	E-MNIST represents a CL problem where the amount of already acquired knowledge can be significantly larger than the amount of new data added with each successive task. Therefore, D20-$1^5$ is performed exclusively for E-MNIST.
+	E-MNIST shall represent a CL problem where the amount of already acquired knowledge can be significantly larger than the amount of new data added with each successive task. Therefore, D20-$1^5$ is performed exclusively for E-MNIST.
 	
-    No feature encoding by foundation models is performed for MNIST, FashionMNIST, E-MNIST and Fruits-360 due to their inherent simplicity. The encoding of SVHN and CIFAR is described in \cref{app:fm}.
+    No feature encoding by foundation models is performed for MNIST, Fashion-MNIST, E-MNIST and Fruits-360 due to their inherent simplicity. The encoding of SVHN and CIFAR is described in \cref{app:fm}.
     %
 	%-------------------------------------------------------------------------
 	\subsection{Evaluation measures}\label{sec:exppeval}
-	Similar to \cite{kemker2018measuring, mundt2021cleva}, we provide the final (averaged) accuracy $\alpha_{T}$, evaluating a scholar $\mathcal{M}_{T}$ on a final test set $T_{\text{ALL}}$ after full training on each sub task $t$ for any given class-incremental learning problem (CIL-P) listed in \cref{tab:slts}. The values are normalized to a range of $\alpha \in [0,1]$. The test set contains previously unseen data samples from all encountered classes. In addition, we also showcase a baseline measure $\alpha^\text{base}$, highlighting the performance of each scholar in a non-continual setting, learning all classes jointly. \\ \\
+	Similar to \cite{kemker2018measuring, mundt2021cleva}, we provide the final (averaged) accuracy $\alpha_{T}$, evaluating a scholar $\mathcal{S}_{T}$ on a test set $T_{\text{ALL}}$ after full training on each sub task $T$ for any given class-incremental learning problem (CIL-P) listed in \cref{tab:slts}. The values are normalized to a range of $\alpha \in [0,1]$. The test set contains previously unseen data samples from all encountered classes. In addition, we also showcase a baseline measure $\alpha^\text{base}$, highlighting the performance of each scholar in a non-continual setting, learning all classes jointly. \\ \\
 	%
-	Furthermore, we demonstrate a forgetting measure $F_{i}^{j}$, defined for task $i$ after training $\mathcal{M}$ on $j$. This shall reflect the loss of knowledge about previous task $i$ and highlights the degradation compared to the peak performance of $\mathcal{M}$ on exactly that task:
+	Furthermore, we demonstrate a forgetting measure $F_{i}^{j}$, defined for task $i$ after training $\mathcal{S}$ on $j$. This shall reflect the loss of knowledge about previous task $i$ and highlights the degradation compared to the peak performance of $\mathcal{S}$ on exactly that task:
 	\begin{equation}
-		F_{i}^{j} = \max_{i\in\{1,..,t-1\}} \alpha_{i,j} - \alpha_{t,j} \qquad \forall j < t.
+		F_{i}^{j} = \max_{i\in\{1,..,T-1\}} \alpha_{i,j} - \alpha_{T,j} \qquad \forall j < T.
 	\end{equation}
-	Average forgetting $F_t$ is then defined as: $F_t = \frac{1}{T-1} \sum^{T-1}_{j=1} F^{t}_{j}$.
-	%
+	Average forgetting $F_t$ is then defined as: $F_t = \frac{1}{T-1} \sum^{T-1}_{j=1} F^{T}_{j}$.
 	%
 	%-------------------------------------------------------------------------
 	\subsection{Experimental Setting}
@@ -229,7 +228,7 @@
 	Training consists of an (initial) run on $T_1$, followed by a sequence of independent (replay) runs on $T_{i>1}$.
 	% Averaged over runs & baseline experiments
 	We perform ten randomly initialized runs for each CIL-Problem, and conduct baseline experiments for all datasets to measure the offline joint-class training performance.
-	\\
+	%
 	\begin{table}[h!]
 		\scriptsize
 		\renewcommand{\arraystretch}{.9}
@@ -264,15 +263,12 @@
 		\label{tab:slts}
 		}
 	\end{table}
-	% TODO: scholars are not M?
-	We set the training mini-batch size to $\beta=100$ ($\beta=50$ for the Fruits dataset). Selective replay of $D_i$ samples is performed before training on task $T_{i}, i>1$ using the current scholar $S_{i-1}$, where $D_i$ represents the amount of training samples contained $T_i$.
-    This strategy keeps the number of generated samples constant w.r.t the number of tasks, and thus comes with modest temporary storage requirements instead of growing linearly with an increasing amount of incoming tasks. 
+	We set the training mini-batch size to $\beta=100$ ($\beta=50$ for the Fruits dataset). Selective replay of $D_i$ samples is performed before training on task $T_{i}, i>1$ using the current scholar $S_{i-1}$, where $D_i$ represents the amount of training samples contained in $T_i$. 
+	This strategy keeps the number of generated samples constant w.r.t the number of tasks, and thus comes with modest temporary storage requirements instead of growing linearly with an increasing amount of incoming tasks.
 	
 	When replaying, mini-batches of $\beta$ samples are randomly drawn, in equal proportions, from the real samples from task $T_i$ and the generated samples representing previous tasks.
 	%For training, mini-batches are randomly drawn from this resulting merged subset $\mathcal{D}_{T_i}$.
-	%
-	
-	It is worth noting that classes will, in general, \textit{not} be balanced in the merged generated/real data at $T_i$, and that it is not required to store the statictics of previously encountered class instances/labels.
+	It is worth noting that classes will, in general, \textit{not} be balanced in the merged generated/real data at $T_i$, and that it is not required to store the statistics of previously encountered class instances/labels.
 	%-------------------------------------------------------------------------
 	\subsection{Selective replay functionality}
 	%
@@ -287,7 +283,7 @@
 		\caption{\label{fig:vargen} An example for variant generation in AR, see \cref{sec:approach} and \cref{fig:var} for details. Left: centroids of the current GMM scholar trained on MNIST classes 0, 4 and 6. Middle: query samples of MNIST class 9. Right: variants generated in response to the query. Component weights and variances are not shown.
 		}
 	\end{figure}
-	First, we demonstrate the ability of a trained GMM to query its internal representation through data samples and selectively generate artificial data that \enquote{best match} those that define the query. To illustrate this, we train a GMM layer of $K=25$ components on MNIST classes 0,4 and 6 for 50 epochs using the best-practice rules described in \cref{app:ar}. Then, we query the trained GMM with samples from class 9 uniquely, as described in \cref{sec:gmm}. The resulting samples are all from class 4, since it is the class that is \enquote{most similar} to the query class. These results are visualized in \cref{fig:var}. Variant generation results for deep convolutional extensions of GMMs can be found in \cite{gepperth2021new}, emphasizing that the AR approach can be scaled to more complex problems.
+	First, we demonstrate the ability of a trained GMM to query its internal representation through data samples and selectively generate artificial data that \enquote{best match} those defining the query. To illustrate this, we train a GMM layer of $K=25$ components on MNIST classes 0, 4 and 6 for 50 epochs using the best-practice rules described in \cref{app:ar}. Then, we query the trained GMM with samples from class 9 uniquely, as described in \cref{sec:gmm}. The resulting samples are all from class 4, since it is the class that is \enquote{most similar} to the query class. These results are visualized in \cref{fig:var}. Variant generation results for deep convolutional extensions of GMMs can be found in \cite{gepperth2021new}, emphasizing that the AR approach can be scaled to more complex problems.
 	%-------------------------------------------------------------------------
 	\subsection{Comparison: AR, ER and DGR-VAE}
 	% BASELINE FOR RAW PIXEL/DATA INPUT
-- 
GitLab