Merge branch 'master' of gitlab.cs.hs-fulda.de:fdai0114/iclr24-ar-foundation

e613e8e9 · ak · 5ebc60d0 · 63868fac · e613e8e9
Commit e613e8e9 authored 1 year ago by ak
--- a/iclr2024_conference.tex
+++ b/iclr2024_conference.tex
@@ -993,9 +993,9 @@
 		%
 	% Ab hier ... bissl DGR bashen h3h3
 	\par\noindent\textbf{Constant-time replay is problematic for DGR}
-	This constraint is especially detrimental for DGR performance, which is apparent from the experiments conducted in this section, see \cref{tab:short_results}. It appears that, to some degree, DGR suffers from catastrophic forgetting for all datasets under investigation. However, forgetting worsens as the number of tasks increases. This is confirmed by experiments with different task sequence lengths (D5-$1^5$, D7-$1^5$, D20-$1^5$). To a lesser extent, this is also observed for ER on e.g. FMNIST, E-MNIST. 
+	We observe that DGR, regardless of what generator is used, performs poorly, see \cref{tab:short_results}. It appears that, DGR suffers from catastrophic forgetting for all datasets under investigation. However, forgetting worsens as the number of tasks increases. This is confirmed by experiments with different task sequence lengths (D5-$1^5$, D7-$1^5$, D20-$1^5$). To a lesser extent, this is also observed for ER on e.g. FMNIST, E-MNIST. 
-	% 
+	% GUT aber zu lang 
-	In addition, we argue that even when a balanced scenario w.r.t tasks/classes is ensured, it may not protect DGR from significant CF. This is linked to the strong tendency of the VAE to shift its internal representation towards the most recent task. This tendency may not be compensated by increasing the amount of generated data, as the representation of past classes is inherently degraded over time by replay artifacts. More so, this would significantly increase the (re-)training time of the scholar with each task addition.
+	%In addition, we argue that even when a balanced scenario w.r.t tasks/classes is ensured, it may not protect DGR from significant CF. This is linked to the strong tendency of the VAE to shift its internal representation towards the most recent task. This tendency may not be compensated by increasing the amount of generated data, as the representation of past classes is inherently degraded over time by replay artifacts. More so, this would significantly increase the (re-)training time of the scholar with each task addition.
 	%
 	In contrast, AR is specifically designed to work well when the amount of generated samples is kept constant for each task in an ever-increasing number of tasks. \Cref{fig:gen_samples_loglik_plot} shows the development of generated sample counts over time for AR and DGR-VAE in a balanced scenario, respectively.
 	%
@@ -1019,6 +1019,8 @@
 	% Ergebnisse von LR kurz zusammmen fassen, was lässt sich beobachten
 	% CF für DGR, kackt total ab...
 	% AR leidet nur unter signifikantem Forgetting bei CIFAR-10; T1 (Klasse 4-9) nimmt stark ab nach T5 (Klasse 2)
+	\par\noindent\textbf{AR --vs-- MIR} 
+	MIr and AR share the concept of selective replay, and they both operate in a constant-time scenario although MIR has to weight generated and new samples differently in the loss. We see in general similar performance, although we must stress that MIR is highly sensitive to parameters like the weights of different terms in the loss, which must be set by cross-validation and are thus strictly speaking not compatible with CL.
 	\par\noindent\textbf{Latent replay/latent AR}
 	For latent replay (SVHN, CIFAR), the results in the upper part of \cref{tab:short_results} show that DGR universally suffers from catastrophic forgetting although having the same baseline performance $\alpha^{\text{base}}$ as latent ER and AR. Forgetting for AR seems to only be significant for CIFAR: D5-$1^5$B after task $T_5$, due to a high overlap with classes from initial task $T_1$.
 	% Argument: ER nur schlecht weil budget zu klein
@@ -1047,8 +1049,9 @@
 	In contrast, self-supervised contrastive learning alleviates the need of a large labeled pre-training dataset but relies on the usage of large mini-batch sizes, as well as complex data augmentation pipelines \cite{dwibedi2021little, chen2020simple,caron2020unsupervised}. We decided against such methods as they only show competitive results when combined with supervised fine-tuning on labeled data \cite{chen2020big}, or significantly increasing the total amount of classes seen in pre-training \cite{gallardo2021self}. % keep
 	\par\noindent\textbf{Time complexity of default CL methods} 
 	Regularization-based approaches like EWC have linear time complexity w.r.t. tasks, since each task adds another term to the loss function. The distillation terms in LwF ensure linear time complexity as well. Vanilla experience replay has an implementation-dependent linear time complexity since the amount of replayed samples depends on the number of previous tasks. By construction, GEM and A-GEM have linear time complexity since constraints must be computed using retained samples from all previous tasks. 
+	%
 	\par\noindent\textbf{Issues with constant-time replay}
-	Instead of achieving balance between new and recalled/generated samples by a linear increase of the latter, many recently proposed replay approaches use only a fixed number $S$ of generated or recalled samples per task. Balance is realized by a higher weight of past samples in the loss \cite{mir}. There are several issues with this: First of all, for a large number of tasks, each task will be less and less represented in $S$ samples, making eventual forgetting inevitable, while weights for past samples grow higher and higher. Then, giving past samples a higher weight effectively increases the learning rate for these samples, which can break SGD if the weights are too high. Alternatively, the weight for the current samples can be \textit{reduced} from its baseline value in some works \cite{van2020brain}, ultimately leading to low learning rates and thus long training times. And lastly, the precise weights are generally set post-hoc via cross-validation \cite{mir,wu2018memory}, which is inadmissible for CL because it amounts to knowing all tasks beforehand. AR can use constant-time replay without weighting past samples due to selective updating and selective replay.
+	Instead of achieving balance between new and recalled/generated samples by a linear increase of the latter, many recently proposed replay approaches use only a fixed number $S$ of generated or recalled samples per task. Balance is realized by a higher weight of past samples in the loss \cite{mir}. There are several issues with this: First of all, for a large number of tasks, each task will be less and less represented in $S$ samples, making eventual forgetting inevitable, while weights for past samples grow higher and higher. Then, giving past samples a higher weight effectively increases the learning rate for these samples, which can break SGD if the weights are too high. Alternatively, the weight for the current samples can be \textit{reduced} from its baseline value in some works \cite{van2020brain}, ultimately leading to low learning rates and thus long training times. And lastly, the precise weights are generally set post-hoc via cross-validation \cite{mir,wu2018memory}, which is inadmissible for CL because it amounts to knowing all tasks beforehand. AR can use constant-time replay without weighting past samples due to selective updating and selective replay. We verified as well that AR, when used in a balanced scenario that linearly increases the number of samples, shows no meaningful performance differences to the constant-time case.
 	%
 	\par\noindent\textbf{Violation of AR assumptions}
    The assumption that new tasks only add a small contribution is not a hard requirement, just a prerequisite for sample efficiency. Based on the formalization presented in \cref{sec:intro}, its validity is trivial to verify by examining component activations of the GMM generator when faced with new data. Although we do not implement such a control strategy here, AR would  simply need to replay more samples if contributions should be large. However, the chances of this happening in practice are virtually zero if the body of existing knowledge is sufficiently large.