--

afce1bbd · Alexander Gepperth · 186065a4 · afce1bbd · afce1bbd · afce1bbd
Commit afce1bbd authored 1 year ago by Alexander Gepperth
--- a/figs/plot.py
+++ b/figs/plot.py
@@ -18,4 +18,4 @@ ax.tick_params(labelsize=15)
 ax.grid()
 plt.tight_layout() ;
 plt.savefig("replay.png") ;
-#plt.show() ;
+plt.show() ;
--- a/iclr2024_conference.pdf
+++ b/iclr2024_conference.pdf
--- a/iclr2024_conference.tex
+++ b/iclr2024_conference.tex
@@ -75,8 +75,8 @@
 \begin{document}
 	\maketitle
 	\begin{abstract}
-		To avoid catastrophic forgetting, replay-based approaches to continual learning (CL) require, for each learning phase with new data, the replay of samples representing \textit{all} of the previously learned knowledge. Since this knowledge grows over time, replay approaches invest linearly growing computational resources just for re-learning what is already known. In this proof-of-concept study, we propose a generative replay-based CL strategy that we term adiabatic replay (AR), which achieves CL in constant time and memory complexity by making use of the (very common) situation where each new learning phase is \textit{adiabatic}, i.e., represents only a small addition to existing knowledge.
+		To avoid catastrophic forgetting, many replay-based approaches to continual learning (CL) require, for each learning phase with new data, the replay of samples representing \textit{all} of the previously learned knowledge. Since this knowledge grows over time, such approaches invest linearly growing computational resources just for re-learning what is already known. In this proof-of-concept study, we propose a generative replay-based CL strategy that we term adiabatic replay (AR), which achieves CL in constant time and memory complexity by making use of the (very common) situation where each new learning phase is \textit{adiabatic}, i.e., represents only a small addition to existing knowledge.
-		The employed Gaussian Mixture Models (GMMs) are capable of \textit{selective updating} only those parts of their internal representation affected by the new task. The information that would otherwise be overwritten by such updates is protected by \textit{selective replay} of samples that are similar to newly arriving ones. Thus, the amount of to-be-replayed samples depends not at all on accumulated, but only on added knowledge, which is small by construction. Based on the challenging CIFAR, SVHN and Fruits datasets in combination with foundation models, we confirm AR's superior scaling behavior while showing better accuracy than deep generative replay using VAEs and vanilla experience replay.
+		The employed Gaussian Mixture Models (GMMs) are capable of \textit{selective updating} only those parts of their internal representation affected by the new task. The information that would otherwise be overwritten by such updates is protected by \textit{selective replay} of samples that are similar to newly arriving ones. Thus, the amount of to-be-replayed samples depends not at all on accumulated, but only on added knowledge, which is small by construction. Based on the challenging CIFAR, SVHN and Fruits datasets in combination with pre-trained feature extractors, we confirm AR's superior scaling behavior while showing better accuracy than common baselines in the field.
 	\end{abstract}
 	%
 	\section{Introduction}\label{sec:intro}
@@ -99,7 +99,7 @@
 	% context: GR %
 	A very promising approach to mitigate catastrophic forgetting (CF) in this scenario are replay strategies \cite{van2020brain}. \textit{Replay} aims at preventing CF by using samples from previous tasks to augment the current one. On the one hand, there are \enquote{true} replay methods which use a small number of stored samples for augmentation. On the other hand, there are \textit{pseudo-replay} methods, where the samples to augment the current task are produced in unlimited number by a generator, which removes the need to store samples. A schematics of the training process in generative replay is given in \cref{fig:genrep}.
 	% problems with CL 
-	Replay seems to represent a promising principled approach to CL, but it nevertheless presents several challenges: 
+	Replay, in its original formulation, proposes a principled approach to CL, but it nevertheless presents several challenges: 
 	%	
 	First of all, if DNNs are employed as solvers and generators, then all classes must be represented in equal proportion at every task in order to have any kind of performance guarantees. 
 	%
@@ -113,8 +113,13 @@
 	%A related issue is the fact that DNN solvers and generators must, in order to have any kind of performance guarantees, be re-trained \textit{from scratch} at every task. Although empirically 
 	%In addition, we are at the mercy of the generator to produce samples in equal frequencies, although this can me mitigated somewhat by class-conditional generation of samples \cite{lesort2019marginal}.
 	%-------------------------------------------------------------------------
+	%TODO: cosntant-time replay
+	%----------------
 	\subsection{Approach: AR}\label{sec:approach}
-	Adiabatic replay (AR) prevents an ever-growing number of replayed samples by applying two main strategies: selective replay and selective updating, see \cref{fig:ar1}. Selective replay means that new data are used to query the generator for similar (potentially conflicting) samples. For achieving selective replay, we rely on fully probabilistic models of machine learning, in particular SGD-trained Gaussian Mixture Models (GMMs) as introduced in \cite{gepperth2021gradient}. GMMs have limited modeling capacity, but are sufficient for the proof-of-concept that we require here \footnote{Deep Convolutional GMMs, see \cite{gepperth2021image} are a potential remedy for the capacity problem.}.
+	Adiabatic replay (AR) prevents an ever-growing number of replayed samples by applying two main strategies: selective replay and selective updating, see \cref{fig:ar1}. Selective replay means that new data are used to query the generator for similar (potentially conflicting) samples. For achieving selective replay, we rely on  Gaussian Mixture Models (GMMs) trained by SGD as introduced in \cite{gepperth2021gradient}. GMMs have limited modeling capacity, but are sufficient for the proof-of-concept that we require here \footnote{Deep Convolutional GMMs, see \cite{gepperth2021image} are a potential remedy for the capacity problem.}.
+    Adiabatic replay is most efficient when each task $t$, with data $\vec x \sim p^{(t)}$, adds only a small amount of new knowledge. AR models the joint distribution of past tasks as a mixture model with $K$ components, $p^{(1\dots t)}(\vec x) = \sum_k \pi_k \mathcal N(\vec x; \vec \mu_k, \mSigma_k)$, we can formalize this assumption as a requirement that only a few components in $p^{(1\dots t)}$ are activated by new data: $|\{ \argmax_k \left(\pi_k \mathcal N(\vec x_i;\vec \mu_k,\mSigma_k)\right)\forall \vec x_i \sim p^{(t)}\}| << K$. 
+    If this assumption is violated, AR will still work but more components will need to be updated, requiring more samples.
 	%
 	% In contrast to DNNs where knowledge is represented in a completely delocalized fashion, GMMs implement a semi-localized knowledge representation.
 	%
@@ -125,14 +130,14 @@
 	%
 	\par\noindent\textbf{Selective replay:} Previous knowledge is not replayed indiscriminately, but only where significant overlap with new data exists. 
 	%
-	\par\noindent\textbf{Selective update:} Previous knowledge is only modified by new data where an overlap exists.
+	\par\noindent\textbf{Selective updating:} Previous knowledge is only modified by new data where an overlap exists.
 	%
 	\par\noindent\textbf{Near-Constant time complexity:} Assuming that each task adds only a small fraction to accumulated knowledge (adiabatic assumption), the number of generated/replayed samples can be small as well, and in particular does not grow with the number of tasks.
    %
-    \par\noindent\textbf{Integration of foundation models:} To process visual problems of higher complexity (SVHN, CIFAR), we incorporate recent advances in latent replay into AR, where we do not replay raw samples but higher-level representations generated by a frozen encoder network.
+    \par\noindent\textbf{Integration of pre-trained feature extractors:} To process visual problems of higher complexity (SVHN, CIFAR), we incorporate recent advances in latent replay into AR, where we do not replay raw samples but higher-level representations generated by a frozen feature extractor network.
 	%-------------------------------------------------------------------------
 	\subsection{Related Work}	%
-	In recent years diverging strategies were presented to mitigate CF in CL scenarios, please refer to \cite{rev1,rev2,rev3,rev4} for an overview. 
+	In recent years, diverging strategies were presented to mitigate CF in CL scenarios, please refer to \cite{rev1,rev2,rev3,rev4} for an overview. 
 	Broad strategies include regularization, parameter isolation and rehearsal. In this article, we focus on rehearsal-type CL, and in particular on deep generative replay (DGR).
 	\par\noindent\textbf{Rehearsal/Replay}
@@ -147,19 +152,19 @@
 	\cite{van2020brain,pellegrini2020latent,kong2023condensed}. Built on this idea are models like REMIND \cite{hayes2020remind}, which extends latent replay by the aspect of compression, or SIESTA \cite{harun2023siesta} which improves computational efficiency by alternating wake and sleep phases in which different parts of the architecture are adapted.
 	%
 	\par\noindent\textbf{MIR}
-	Maximally interferenced replay (MIR), proposed in \cite{mir} is an approach to class-incremental CL where the replay budget (either for experience replay or generative replay) is focused on samples that undergo the highest loss increase when training on a new task. For DGR, the authors propose a gradient-based algorithm to specifically generate such samples using a VAE generator. Conceptually, this is somewhat similar to the concept of selective replay, although a key difference is that our GMM generator/solver can generate maximally interferenced samples directly and efficiently by variant generation, a property that carries over to deep convolutional GMMs as shown in \cite{gepperth2021new}. In addition, we exploit this selective replay capacity specifically to enable constant-time GR. 
+	Maximally interfered retrieval (MIR), proposed in \cite{mir} is an approach to class-incremental CL where a fixed replay budget (either for experience replay or generative replay) is focused on samples that undergo the highest loss increase when training on a new task. For DGR, the authors propose a gradient-based algorithm to specifically generate such samples using a VAE generator. Conceptually, this is similar to the concept of selective replay, although a key difference is that our GMM generator/solver is capable of selective updating as well. We will use MIR as one of the baselines for our experiments.
 	%
 	%-------------------------------------------------------------------------
 	\section{Methods}\label{sec:gmm} % Übersicht über Methoden am Anfang. Refs auf appendix bei Literatur-MEthoden wie ER, DGR und foundation models. AR ausführlich erklären
 	\begin{figure}[h!]
 		\centering
 		\includegraphics*[width=0.6\linewidth,page=2,viewport=0in 0in 5.6in 2.5in]{figs/figs.pdf}
-		\caption{The proposed AR approach, illustrated in an exemplary MNIST setting. The scholar (GMM) has been trained on MNIST classes 0, 4 and 6 in task $T_1$. At task $T_2$, new data (class 9) is used to \textit{query} the scholar for similar samples, resulting in the selective replay of mostly 4's but no 0's. The scholar is re-trained \textit{from its current state}, so no data concerning class 0 is required. Re-training results in the insertion of 9's into the existing components. This mechanism works identically for higher-level features produced by a foundation model.}
+		\caption{The proposed AR approach, illustrated in an exemplary MNIST setting. The scholar (GMM) has been trained on MNIST classes 0, 4 and 6 in task $T_1$. At task $T_2$, new data (class 9) is used to \textit{query} the scholar for similar samples, resulting in the selective replay of mostly 4's but no 0's. The scholar is re-trained \textit{from its current state}, so no data concerning class 0 is required. Re-training results in the insertion of 9's into the existing components. This mechanism works identically for higher-level features produced by a pre-trained feature extractor.}
 		\label{fig:var}
 	\end{figure}
 	%
-	The main techniques used in the experiments of this article are adiabatic replay (AR), experience replay (ER), deep generative replay (DGR) and foundation models.
+	The main techniques used in the experiments of this article are adiabatic replay (AR), experience replay (ER), deep generative replay (DGR) and pre-trained feature extractors.
-	We refer to the appendix for details on the experimental settings concerning ER (\cref{app:er}), DGR (\cref{app:dgr}) and the encoding of data by foundation models (\cref{app:fm}), whereas we will discuss the details of AR in this section. 
+	We refer to the appendix for details on the experimental settings concerning ER (\cref{app:er}), DGR (\cref{app:dgr}) and the encoding of data by pre-trained models (\cref{app:fm}), whereas we will discuss the details of AR in this section. 
 	%
 	\subsection{Adiabatic replay (AR)}
 	% TODO: "as a generator as well as a feature generator for the solver" könnte etwas unklar sein!
@@ -204,7 +209,7 @@
 	E-MNIST shall represent a CL problem where the amount of already acquired knowledge can be significantly larger than the amount of new data added with each successive task. Therefore, D20-$1^5$ is performed exclusively for E-MNIST.
-    No feature encoding by foundation models is performed for MNIST, Fashion-MNIST, E-MNIST and Fruits-360 due to their inherent simplicity. The encoding of SVHN and CIFAR is described in \cref{app:fm}.
+    No feature encoding is performed for MNIST, Fashion-MNIST, E-MNIST and Fruits-360 due to their inherent simplicity. The encoding of SVHN and CIFAR is described in \cref{app:fm}.
    %
 	%-------------------------------------------------------------------------
 	\subsection{Evaluation measures}\label{sec:exppeval}
@@ -520,33 +525,36 @@
 	\section{Discussion}
 	In summary, we can state that our AR approach clearly surpasses VAE-based DGR in the evaluated CIL-P when constraining replay to a constant-time strategy. This is remarkable because the AR scholar performs the tasks of both solver and generator, while at the same time having less parameters. The advantage of AR becomes even more pronounced when considering forgetting prevention instead of simply looking at the classification accuracy results.
 	%
-	Based on this proof-of-concept, we may therefore conclude that AR offers a principled approach to truly long-term CL. In the following text, we will discuss salient points concerning our evaluation methodology and the conclusions we draw from the results:
+	We may therefore conclude that AR offers a principled approach to truly long-term CL. In the following text, we will discuss salient points concerning our evaluation methodology and the conclusions we draw from the results:
 	% assumptions that go into proportional strategy -> discussion 
 	\par\noindent\textbf{Data}
-	Some datasets are not considered meaningful benchmarks in non-continual ML due to their simplicity. Still, many CL studies rely on these two datasets, which is why they are included for comparison purposes. SVHN and CIFAR-10 in particular are considered challenging for generative replay methods, see \cite{mir}. E-MNIST represents a simple classification benchmark that is quite hard for CL due to the large number of classes and is well-suited to the targeted AR scenario where each task adds only a small fraction of knowledge. Finally, the Fruits-360 dataset, besides being more complex and more high-dimensional, provides a fairer comparison since it can be solved to equal accuracy by all considered methods in the baseline condition. Any differences are thus intrinsically due to CL performance. % keep
+	Some datasets are not considered meaningful benchmarks in non-continual ML due to their simplicity. Still, many CL studies rely on these two datasets, which is why they are included for comparison purposes. SVHN and CIFAR-10 in particular are considered challenging for generative replay methods, see \cite{mir}. E-MNIST represents a simple classification benchmark that is quite hard for CL due to the large number of classes and is well-suited to the targeted AR scenario where each task adds only a small fraction of knowledge. Finally, the Fruits-360 dataset, besides being more complex and more high-dimensional, provides a fairer comparison since it can be solved to equal accuracy by all considered methods in the baseline condition. Any differences are thus intrinsically due to CL performance. 
-	\par\noindent\textbf{Foundation models} 
+	% TODO: KÜRZER!
-	The use of foundation models is appealing in CL, since a lot of complexity can be "outsourced" to these models. As shown in \cite{ostapenko2022continual}, the effectiveness of feature extraction from a frozen pre-trained model relies on the relation between downstream and upstream tasks.
+	\par\noindent\textbf{Pre-trained feature extractors} 
+	The use of pre-trained models is appealing in CL, since a lot of complexity can be "outsourced" to these models. As shown in \cite{ostapenko2022continual}, the effectiveness of feature extraction from a frozen pre-trained model relies on the relation between downstream and upstream tasks.
 	There seems to be excellent agreement between the often-used combination of CIFAR and ImageNet, but does not extend to, e.g., the SVHN and Fruits datasets without fine-tuning. Thus, we chose separate pre-trained models for each dataset that were optimized in a supervised fashion (SupCon) on similar but not identical data, following \cite{van2020brain}. 	
 	% Sensitive to changes in augementation, upstream/downstream tasks and architectural changes
 	In contrast, self-supervised contrastive learning alleviates the need of a large labeled pre-training dataset but relies on the usage of large mini-batch sizes, as well as complex data augmentation pipelines \cite{dwibedi2021little, chen2020simple,caron2020unsupervised}. We decided against such methods as they only show competitive results when combined with supervised fine-tuning on labeled data \cite{chen2020big}, or significantly increasing the total amount of classes seen in pre-training \cite{gallardo2021self}. % keep
-	\par\noindent\textbf{Time complexity of other CL methods} 
+	\par\noindent\textbf{Time complexity of default CL methods} 
-	Regularization-based approaches like EWC have linear time complexity w.r.t. tasks, since each task adds another term to the loss function. The distillation terms in LwF ensure linear time complexity as well. Vanilla experience replay has an implementation-dependent linear time complexity since the amount of replayed samples depends on the number of previous tasks. When used with a constant replay budget as we do here, we run the risk that the number of samples for a given task might be too low to have an effect, so realistically the amount of replayed samples should be proportional to the number of tasks. By construction, GEM and A-GEM have linear time complexity since constraints must be computed using retained samples from all previous tasks. 
+	Regularization-based approaches like EWC have linear time complexity w.r.t. tasks, since each task adds another term to the loss function. The distillation terms in LwF ensure linear time complexity as well. Vanilla experience replay has an implementation-dependent linear time complexity since the amount of replayed samples depends on the number of previous tasks. By construction, GEM and A-GEM have linear time complexity since constraints must be computed using retained samples from all previous tasks. 
 	% TODO: all vs. nearly all -> sind es wirklich "Alle" ???
 	Summing up, all currently proposed approaches to CL show linear time complexity, often made worse by linear memory complexity. % keep
-	\par\noindent\textbf{AR assumptions}
+	\par\noindent\textbf{Issues with constant-time replay}
-    Adiabatic replay assumes that new tasks only add a small contribution to total knowledge. This is not actually a hard requirement, and if additions should be large w.r.t. existing knowledge, new data will usually involve more samples, leading to more generated data that are replayed by AR.
+	Many recently proposed methods operate in a constant-time regime, of samples for each new task. To balance a costant number $S$ of samples, either from memory or from a generator, w.r.t. samples from the current task, 
+	the former are given a higher weight in the loss \cite{mir}. There are several issues with this: First of all, for a large number of tasks, each task will be less and less represented in $S$ samples, making eventual forgetting inevitable, while weights for past samples grow higer and higher. Then, giving past samples a higher weight effectively increases the learning rate for these samples, which can break SGD if the weights are too high. Alernatively, the weight for the current samples is \textit{reduced} from its baseline value in some works \cite{vandeven}, ultimately leading to low learning rates and thus long training times. And lastly, the precise weights are generally set post-hoc via cross-validation \cite{mir,mergan}, which is inadmissible for CL because it amounts to knowing all tasks beforehand. AR can use constant-time replay without weighting past samples due to selective updating and selective replay.
 	%
-	Of course it is theoretically possible that a small set of new samples will overlap strongly with \textit{all} of previously learned knowledge. In this case, previous knowledge would be insufficiently represented by AR, and forgetting may occur. However, the chances of this happening in practice are virtually zero if the body of existing knowledge is sufficiently large.
+	\par\noindent\textbf{Violation of AR assumptions}
+    The assumption that new tasks only add a small contribution is not a hard requirement, just a prerequsite for sample efficiency. Based on the formalization presented in \cref{intro}, its validity is trivial to verify by examining component activations of the GMM generator when faced with new data. Although we do not implement such a control strategy here, AR would  simply need to replay more samples if contributions should be large. However, the chances of this happening in practice are virtually zero if the body of existing knowledge is sufficiently large.
 	%
-	\par\noindent\textbf{Choice of maximally interferenced samples}
+	%\par\noindent\textbf{Choice of maximally interfered samples}
-	For DNN generators such as VAEs, it is not obvious why samples that are "similar" to newly arriving ones should be replayed. However, for the GMM generators used here, things are a little different. First of all, "similarity" is to be understood in terms of GMM posterior probability: two samples are similar if they have a high probability to be generated from the same component. Thus, the best-matching component for a new sample will automatically generate similar samples, a process we term \textit{variant generation}. Due to selective updating, only this component will be adapted to the new sample, leading to maximal potential loss increase for similar samples from previous tasks.
+	%For DNN generators such as VAEs, it is not obvious why samples that are "similar" to newly arriving ones should be replayed. However, for the GMM generators used here, things are a little different. First of all, "similarity" is to be understood in terms of GMM posterior probability: two samples are similar if they have a high probability to be generated from the same component. Thus, the best-matching component for a new sample will automatically generate similar samples, a process we term \textit{variant generation}. Due to selective updating, only this component will be adapted to the new sample, leading to maximal potential loss increase for similar samples from previous tasks.
 	%
-	\par\noindent\textbf{Forgetting} 
+	%\par\noindent\textbf{Forgetting} 
-	The use of probabilistic models such as GMMs has another interesting consequence, namely the ability to control forgetting.
+	%The use of probabilistic models such as GMMs has another interesting consequence, namely the ability to control forgetting.
-	Following \cite{zhou2022fortuitous}, forgetting could be a useful functionality investigated in future work. This is simple to control in GMMs by eliminating certain components without affecting the remaining ones.
+	%Following \cite{zhou2022fortuitous}, forgetting could be a useful functionality investigated in future work. This is simple to %control in GMMs by eliminating certain components without affecting the remaining ones.
 	%
-	\par\noindent\textbf{A word on GANs} 
+	%\par\noindent\textbf{A word on GANs} 
-	GANs, although in principle capable of generating high-quality samples, are well-known to require considerable data-dependent tuning to perform well and, e.g., avoid mode collapse (see, e.g., \cite{thanh2020catastrophic}). We therefore decided not to rely on GANs for DGR. Our choice of using VAEs for DGR has been supported by other studies, e.g., \cite{lesort2019marginal,mundt2020wholistic,van2020brain}, using VAEs with comparable CL performance to GANs (see also \cite{mir}). Since in CL, all tasks except the first are unknown, we can use data from the first task only to tune GAN structure and hyper-parameters without violating CL conventions. In contrast, VAEs were shown to be robust to problems like mode collapse, which makes them more suitable for CL.
+	%GANs, although in principle capable of generating high-quality samples, are well-known to require considerable data-dependent tuning to perform well and, e.g., avoid mode collapse (see, e.g., \cite{thanh2020catastrophic}). We therefore decided not to rely on GANs for DGR. Our choice of using VAEs for DGR has been supported by other studies, e.g., \cite{lesort2019marginal,mundt2020wholistic,van2020brain}, using VAEs with comparable CL performance to GANs (see also \cite{mir}). Since in CL, all tasks except the first are unknown, we can use data from the first task only to tune GAN structure and hyper-parameters without violating CL conventions. In contrast, VAEs were shown to be robust to problems like mode collapse, which makes them more suitable for CL.
 	%
 	\par\noindent\textbf{Initial annealing radius tuning}
 	AR contains a few technical details that require tuning, like the initial annealing radius parameter $r_0$ when re-training with new task data. We used a single value for all experiments, but performance is sensitive to this choice, since it represents a trade-off between new data acquisition and knowledge retention. Therefore, we intend to develop an automated control strategy for this parameter to facilitate experimentation.
@@ -568,12 +576,11 @@
 %	
 \appendix
 \clearpage
-	\section{Feature Encoding by foundation models} \label{app:fm}
+	\section{Use of pre-trained feature extractors} \label{app:fm}
-	Encoding features using foundation models, i.e., frozen and pre-trained networks that transform raw-data into a higher-level and invariant representation to operate on, have shown to be beneficial for CL \cite{van2020brain,hayes2020remind,pellegrini2020latent}. A current promising direction of pre-training such models is \textit{contrastive learning}, which is performed in a supervised \cite{khosla2020supervised} (SupCon) or self-supervised fashion {\cite{caron2020unsupervised, chen2020simple,dwibedi2021little} (SSCL). In this study, we rely on SupCon to 
+	Encoding features using pre-trained networks that transform raw-data into a higher-level and invariant representation to operate on, have shown to be beneficial for CL \cite{van2020brain,hayes2020remind,pellegrini2020latent}. A current promising direction of pre-training such models is \textit{contrastive learning}, which is performed in a supervised \cite{khosla2020supervised} (SupCon) or self-supervised fashion {\cite{caron2020unsupervised, chen2020simple,dwibedi2021little} (SSCL). In this study, we rely on SupCon to build a robust feature extractor for more complex datasets (SVHN, CIFAR). 
-	build a robust feature extractor for more complex datasets (SVHN, CIFAR). 
 	Here, we take a portion of the data from the target domain for pre-training, but exclude these instances from further usage in downstream CL tasks. For SVHN, we pull an amount equal to 0.5 of the total training samples from the "extra" split. For CIFAR10 we split the training set in half and use one for pre-training and the other for encoding and later usage in downstream CL.
-	The data used to pre-train the foundation model are thus similar but not identical to subsequent training data, following the approach of \cite{van2020brain}.
+	The data used to pre-train the feature extractor are thus similar but not identical to subsequent training data, following the approach of \cite{van2020brain}.
 	An additional data augmentation module normalizes the input, performs random horizontal flipping and rotation in the range of $-2\%*2\pi - +2\%*2\pi$ for each input image. The encoder backbone is a ResNet-50 with randomly initialized weights and is trained for $256$ epochs using a batch size of $\beta=256$. No further fine-tuning is performed after pre-training. We use the normalized activations of the final pooling layer ($D = 2048$) as the representation vector.
@@ -652,7 +659,7 @@
 	The learning rate for solvers and VAE generators are $\epsilon_S=10^{-4}$, $\epsilon_G=10^{-3}$ using the ADAM optimizer with $\beta_{1}=0.9$, $\beta_{2}=0.999$. Generator and solver training is performed for 100 and 50 epochs respectively. We reinitialize both structures before each new task on MNIST, FashionMNIST, E-MNIST and Fruits-360.
 	%-------------------------------------------------------------------------
 	\section{Experience Replay training}\label{app:er}
-	The solvers for ER are taken to be the same as for VAE-DGR (see the rightmost column in  \cref{tab:networkstructure}). The ADAM optimizer is used with a learning rate of $10^{-4}$, $\beta_{1}=0.9$, $\beta_{2}=0.999$, and the network is trained for 50 epochs on each task. Analogous to the procedure for DGR, we use replay on latent feature representations, see e.g., \cite{pellegrini2020latent} encoded by a foundation model as described in \cref{app:fm} for SVHN and CIFAR.
+	The solvers for ER are taken to be the same as for VAE-DGR (see the rightmost column in  \cref{tab:networkstructure}). The ADAM optimizer is used with a learning rate of $10^{-4}$, $\beta_{1}=0.9$, $\beta_{2}=0.999$, and the network is trained for 50 epochs on each task. Analogous to the procedure for DGR, we use replay on latent feature representations, see e.g., \cite{pellegrini2020latent} encoded by a pre-trained feature extractor as described in \cref{app:fm} for SVHN and CIFAR.
 	Similar to \cite{riemer2018learning}, reservoir sampling is used to select $50$ samples of each encountered class to be stored. For replay, oversampling of the buffer is performed to obtain a number of samples, equal to the amount of data instances present in the current task $T_i$.