The main techniques used in the experiments of this article are adiabatic replay (AR), experience replay (ER), deep generative replay (DGR) and foundation models.
We refer to the appendix for details and experimental settings concerning ER (\cref{app:er}), DGR (\cref{app:dgr}) and the encoding of data by foundation models (\cref{app:fm}), whereas we will discuss the details of AR in this section.
We refer to the appendix for details on the experimental settings concerning ER (\cref{app:er}), DGR (\cref{app:dgr}) and the encoding of data by foundation models (\cref{app:fm}), whereas we will discuss the details of AR in this section.
%
\subsection{Adiabatic replay (AR)}
%
In contrast to conventional replay, where a scholar is composed of a generator and a solver network, see \cref{fig:genrep}, AR proposes scholars where a single network acts as a geneator as well as a feature generator for the solver.
% TODO: "as a generator as well as a feature generator for the solver" könnte etwas unklar sein!
In contrast to conventional replay, where a scholar is composed of a generator and a solver network, see \cref{fig:genrep}, AR proposes scholars where a single network acts as a generator as well as a feature generator for the solver.
Assuming a suitable scholar (see below), the high-level logic of AR is shown in \cref{fig:var}: Each sample from a new task is used to \textit{query} the scholar, which generates a similar, known sample. Mixing new and generated samples in a defined, constant proportion creates the training data for the current task.
A new sample will cause adaptation of the scholar in a localized region of data space. Variants generated by that sample will, due to similarity, cause adaptation in the same region. Knowledge in the overlap region will therefore be adapted to represent both, while dissimilar regions stay unaffected (see \cref{fig:var} for a visual impression).
...
...
@@ -183,7 +183,7 @@
To reduce noise, top-S sampling is introduced, where only the $S=2$ highest values of the responsibilities are used for selection.
% CLASSIFY
\par\noindent\textbf{Solver}
functions are performed by feeding GMM responsibilities into a bias-free, linear regression layer as $\vo(\vx_n)=\mW\vgamma(\vx_n)$.
functions are performed by feeding GMM responsibilities into a linear regression layer as $\vo(\vx_n)=\mW\vgamma(\vx_n)$.
We use a MSE loss and drop the bias term to reduce the sensitivity to unbalanced classes.
% TRAIN
\par\noindent\textbf{GMM training}
...
...
@@ -199,22 +199,21 @@
\par\noindent\textbf{SVHN}~\cite{netzer2011reading} contains 60.000 RGB images of house numbers ($0$-$9$, resolution $32$\,$\times$\,$32$).
\par\noindent\textbf{CIFAR-10}~\cite{krizhevsky2009learning} contains 60.000 RGB images of natural objects, resolution 32x32, in 10 balanced classes.
CL problems formed from these datasets according to the default class-incremental scenario (\cref{sec:intro}) have been shown to be far from the theoretically optimal performance, see, e.g., \cite{pfulb2019comprehensive,mir}. CL tasks formed used in the literature are D9-1, D5$^2$, D2$^5$ or D1$^{10}$. Expressions like D$2^5$ are to be read as D2-2-2-2-2.
CL problems formed from these datasets according to the default class-incremental scenario (\cref{sec:intro}) have been shown to be far from the theoretically optimal performance, see, e.g., \cite{pfulb2019comprehensive,mir}. CL tasks typically used in the literature are D9-1, D5$^2$, D2$^5$ or D1$^{10}$. Expressions like D$2^5$ are to be read as D2-2-2-2-2.
%
E-MNIST represents a CL problem where the amount of already acquired knowledge can be significantly larger than the amount of new data added with each successive task. Therefore, D20-$1^5$ is performed exclusively for E-MNIST.
E-MNIST shall represent a CL problem where the amount of already acquired knowledge can be significantly larger than the amount of new data added with each successive task. Therefore, D20-$1^5$ is performed exclusively for E-MNIST.
No feature encoding by foundation models is performed for MNIST, FashionMNIST, E-MNIST and Fruits-360 due to their inherent simplicity. The encoding of SVHN and CIFAR is described in \cref{app:fm}.
No feature encoding by foundation models is performed for MNIST, Fashion-MNIST, E-MNIST and Fruits-360 due to their inherent simplicity. The encoding of SVHN and CIFAR is described in \cref{app:fm}.
Similar to \cite{kemker2018measuring, mundt2021cleva}, we provide the final (averaged) accuracy $\alpha_{T}$, evaluating a scholar $\mathcal{M}_{T}$ on a final test set $T_{\text{ALL}}$ after full training on each sub task $t$ for any given class-incremental learning problem (CIL-P) listed in \cref{tab:slts}. The values are normalized to a range of $\alpha\in[0,1]$. The test set contains previously unseen data samples from all encountered classes. In addition, we also showcase a baseline measure $\alpha^\text{base}$, highlighting the performance of each scholar in a non-continual setting, learning all classes jointly. \\\\
Similar to \cite{kemker2018measuring, mundt2021cleva}, we provide the final (averaged) accuracy $\alpha_{T}$, evaluating a scholar $\mathcal{S}_{T}$ on a test set $T_{\text{ALL}}$ after full training on each sub task $T$ for any given class-incremental learning problem (CIL-P) listed in \cref{tab:slts}. The values are normalized to a range of $\alpha\in[0,1]$. The test set contains previously unseen data samples from all encountered classes. In addition, we also showcase a baseline measure $\alpha^\text{base}$, highlighting the performance of each scholar in a non-continual setting, learning all classes jointly. \\\\
%
Furthermore, we demonstrate a forgetting measure $F_{i}^{j}$, defined for task $i$ after training $\mathcal{M}$ on $j$. This shall reflect the loss of knowledge about previous task $i$ and highlights the degradation compared to the peak performance of $\mathcal{M}$ on exactly that task:
Furthermore, we demonstrate a forgetting measure $F_{i}^{j}$, defined for task $i$ after training $\mathcal{S}$ on $j$. This shall reflect the loss of knowledge about previous task $i$ and highlights the degradation compared to the peak performance of $\mathcal{S}$ on exactly that task:
\begin{equation}
F_{i}^{j} = \max_{i\in\{1,..,t-1\}}\alpha_{i,j} - \alpha_{t,j}\qquad\forall j < t.
F_{i}^{j} = \max_{i\in\{1,..,T-1\}}\alpha_{i,j} - \alpha_{T,j}\qquad\forall j < T.
\end{equation}
Average forgetting $F_t$ is then defined as: $F_t =\frac{1}{T-1}\sum^{T-1}_{j=1} F^{t}_{j}$.
%
Average forgetting $F_t$ is then defined as: $F_t =\frac{1}{T-1}\sum^{T-1}_{j=1} F^{T}_{j}$.
Training consists of an (initial) run on $T_1$, followed by a sequence of independent (replay) runs on $T_{i>1}$.
% Averaged over runs & baseline experiments
We perform ten randomly initialized runs for each CIL-Problem, and conduct baseline experiments for all datasets to measure the offline joint-class training performance.
\\
%
\begin{table}[h!]
\scriptsize
\renewcommand{\arraystretch}{.9}
...
...
@@ -264,15 +263,12 @@
\label{tab:slts}
}
\end{table}
% TODO: scholars are not M?
We set the training mini-batch size to $\beta=100$ ($\beta=50$ for the Fruits dataset). Selective replay of $D_i$ samples is performed before training on task $T_{i}, i>1$ using the current scholar $S_{i-1}$, where $D_i$ represents the amount of training samples contained $T_i$.
We set the training mini-batch size to $\beta=100$ ($\beta=50$ for the Fruits dataset). Selective replay of $D_i$ samples is performed before training on task $T_{i}, i>1$ using the current scholar $S_{i-1}$, where $D_i$ represents the amount of training samples contained in $T_i$.
This strategy keeps the number of generated samples constant w.r.t the number of tasks, and thus comes with modest temporary storage requirements instead of growing linearly with an increasing amount of incoming tasks.
When replaying, mini-batches of $\beta$ samples are randomly drawn, in equal proportions, from the real samples from task $T_i$ and the generated samples representing previous tasks.
%For training, mini-batches are randomly drawn from this resulting merged subset $\mathcal{D}_{T_i}$.
%
It is worth noting that classes will, in general, \textit{not} be balanced in the merged generated/real data at $T_i$, and that it is not required to store the statictics of previously encountered class instances/labels.
It is worth noting that classes will, in general, \textit{not} be balanced in the merged generated/real data at $T_i$, and that it is not required to store the statistics of previously encountered class instances/labels.
\caption{\label{fig:vargen} An example for variant generation in AR, see \cref{sec:approach} and \cref{fig:var} for details. Left: centroids of the current GMM scholar trained on MNIST classes 0, 4 and 6. Middle: query samples of MNIST class 9. Right: variants generated in response to the query. Component weights and variances are not shown.
}
\end{figure}
First, we demonstrate the ability of a trained GMM to query its internal representation through data samples and selectively generate artificial data that \enquote{best match} those that define the query. To illustrate this, we train a GMM layer of $K=25$ components on MNIST classes 0,4 and 6 for 50 epochs using the best-practice rules described in \cref{app:ar}. Then, we query the trained GMM with samples from class 9 uniquely, as described in \cref{sec:gmm}. The resulting samples are all from class 4, since it is the class that is \enquote{most similar} to the query class. These results are visualized in \cref{fig:var}. Variant generation results for deep convolutional extensions of GMMs can be found in \cite{gepperth2021new}, emphasizing that the AR approach can be scaled to more complex problems.
First, we demonstrate the ability of a trained GMM to query its internal representation through data samples and selectively generate artificial data that \enquote{best match} those defining the query. To illustrate this, we train a GMM layer of $K=25$ components on MNIST classes 0,4 and 6 for 50 epochs using the best-practice rules described in \cref{app:ar}. Then, we query the trained GMM with samples from class 9 uniquely, as described in \cref{sec:gmm}. The resulting samples are all from class 4, since it is the class that is \enquote{most similar} to the query class. These results are visualized in \cref{fig:var}. Variant generation results for deep convolutional extensions of GMMs can be found in \cite{gepperth2021new}, emphasizing that the AR approach can be scaled to more complex problems.