\par\noindent\textbf{Baseline and initial task performance:}
\par\noindent\textbf{Baseline and initial task performance}
We observe superior joint training (i.e., non-CL) test accuracy $\alpha^{base}$ for DGR and ER on all datasets except Fruits where the results are identical, see \cref{tab:short_results} (bottom part). This is especially clear for experiments where the scholar is confronted with \enquote{raw} input data. A possible reasoning behind this is, that DGR and ER benefit from their internal CNN structure which is inherently capable of efficiently capturing the distribution of high-dimensional image data and becomes less prone to invariance.
% TODO: warum benutzt ER auf latent features so wenig parameter?
On the other hand, AR relies on a considerably less complex structure in its current state. Furthermore, it should be noted that DGR and ER use significantly more trainable parameters, especially when operating on raw input. For DGR, the ratio is 3.7 when applied to RGB data and 4.125 when applied to latent features. For ER, the ratio is 4.7 for RGB and 0.375 for latent features.
...
...
@@ -475,7 +475,7 @@
%
%
% Ab hier ... bissl DGR bashen h3h3
\par\noindent\textbf{Constant-time replay is problematic for DGR:}
\par\noindent\textbf{Constant-time replay is problematic for DGR}
This constraint is especially detrimental for DGR performance, which is apparent from the experiments conducted in this section, see \cref{tab:short_results}. It appears that, to some degree, DGR suffers from catastrophic forgetting for all datasets under investigation. However, forgetting worsens as the number of tasks increases. This is confirmed by experiments with different task sequence lengths (D5-$1^5$, D7-$1^5$, D20-$1^5$). To a lesser extent, this is also observed for ER on e.g. FMNIST, E-MNIST.
%
In addition, we argue that even when a balanced scenario w.r.t tasks/classes is ensured, it may not protect DGR from significant CF. This is linked to the strong tendency of the VAE to shift its internal representation towards the most recent task. This tendency may not be compensated by increasing the amount of generated data, as the representation of past classes is inherently degraded over time by replay artifacts. More so, this would significantly increase the (re-)training time of the scholar with each task addition.
...
...
@@ -503,18 +503,18 @@
% Ergebnisse von LR kurz zusammmen fassen, was lässt sich beobachten
% CF für DGR, kackt total ab...
% AR leidet nur unter signifikantem Forgetting bei CIFAR-10; T1 (Klasse 4-9) nimmt stark ab nach T5 (Klasse 2)
\par\noindent\textbf{Latent replay/latent AR: }
\par\noindent\textbf{Latent replay/latent AR}
For latent replay (SVHN and CIFAR), the results (\cref{tab:short_results}, upper part) show that DGR universally suffers from catastrophic forgetting although having the same baseline performance $\alpha^{\text{base}}$ as latent ER and AR. Forgetting for AR seems to only be significant for CIFAR D5-$1^5$B after task $T_5$, due to a high overlap with classes from initial task $T_1$.
% Argument: ER nur schlecht weil budget zu klein
Moreover, it is surprising to see that latent AR is able to achieve generally better results than latent ER. It could be argued that the budget per class for a complex dataset like SVHN and CIFAR-10 is rather small, and it can be assumed that increasing the budget would increase CL performance. However, we stress again that this is not trivially applicable in scenarios with a constrained memory budget.
%
\par\noindent\textbf{CF and selective replay:}
\par\noindent\textbf{CF and selective replay}
AR shows promising results in terms of knowledge retention, or prevention of forgetting, for sequentially learned classes, as reflected by generally lower average forgetting.
In virtually all of the experiments conducted we observed a very moderate loss of information about the first task $T_1$ after full training, suggesting that AR's ability to handle small incremental additions/updates to the internal knowledge base over a sequence of tasks is an intrinsic property, due to the selective replay mechanism.
Moreover, AR demonstrates its intrinsic ability to limit unnecessary overwrites of past knowledge by performing efficient \textit{selective updates}, instead of having to replay the entire accumulated knowledge each time.
%
\par\noindent\textbf{Selective updates:}
\par\noindent\textbf{Selective updates}
As performed by AR training, are mainly characterized by matching GMM components with arriving input. Therefore, performance on previous tasks generally decreases only slightly by the adaptation of selected/similar units, as shown by the low forgetting rates for almost all CIL-problems studied in \cref{tab:short_results}. This implies that the GMM tends to converge towards a \textit{trade-off} between past knowledge and new data. This effect is most notable when there is successive (replay-)training for two classes with high similarity in the input space, such as in, F-MNIST D5-$1^5$A, where task $T_2$ (class: \enquote{sandals}) and task $T_4$ (class: \enquote{sneakers}) compete for internal capacity.
% ---------------------
\section{Discussion}
...
...
@@ -522,7 +522,7 @@
%
Based on this proof-of-concept, we may therefore conclude that AR offers a principled approach to truly long-term CL. In the following text, we will discuss salient points concerning our evaluation methodology and the conclusions we draw from the results:
% assumptions that go into proportional strategy -> discussion
\par\noindent\textbf{Data}%TODO: MORE ON CIFAR + SVHN!!
\par\noindent\textbf{Data}
Some datasets are not considered meaningful benchmarks in non-continual ML due to their simplicity. Still, many CL studies rely on these two datasets, which is why they are included for comparison purposes. SVHN and CIFAR-10 in particular are considered challenging for generative replay methods, see \cite{mir}. E-MNIST represents a simple classification benchmark that is quite hard for CL due to the large number of classes and is well-suited to the targeted AR scenario where each task adds only a small fraction of knowledge. Finally, the Fruits-360 dataset, besides being more complex and more high-dimensional, provides a fairer comparison since it can be solved to equal accuracy by all considered methods in the baseline condition. Any differences are thus intrinsically due to CL performance. % keep
\par\noindent\textbf{Foundation models}
The use of foundation models is appealing in CL, since a lot of complexity can be "outsourced" to these models. As shown in \cite{ostapenko2022continual}, the effectiveness of feature extraction from a frozen pre-trained model relies on the relation between downstream and upstream tasks.
...
...
@@ -552,19 +552,12 @@
AR contains a few technical details that require tuning, like the initial annealing radius parameter $r_0$ when re-training with new task data. We used a single value for all experiments, but performance is sensitive to this choice, since it represents a trade-off between new data acquisition and knowledge retention. Therefore, we intend to develop an automated control strategy for this parameter to facilitate experimentation.
%
\section{Conclusion}
% TODO: Ggf. auf den Argumentationen von: \cite{harun2023siesta} aufbauen 0???
% Wäre als Abschluss ganz nett um nochmal das Hauptargument der konstanten Speicher/Laufzeit-Komplexität zu verdeutlichen...
% 1) Computational efficiency in comparison to re-training the whole structure
% 2) Operate in a compute/memory constrained environment
%
We believe that continual learning (CL) holds the potential to spark a new machine learning revolution, since it allows, if it could be made to work in practice, the training of models over long times. In this study, we show a proof-of-concept for CL that operates at a time complexity that is independent of the amount of previously acquired knowledge, which is something we also observe in humans. Clearly, more complex problems must be investigated under less restricted conditions, and a comprehensive comparison to other CL methods (EWC, LwF, GEM) will be provided in future work. Above all, we believe that more complex probabilistic models such as deep convolutional GMMs must be employed to adequately describe more complex data distributions.
TODO: Noch platz
TODO: Noch platz
TODO: Noch platz
We firmly believe that continual learning (CL) holds the potential to spark a new machine learning revolution, since it allows, if it could be made to work in practice, the training of models over long times.
% TODO: Check if OK; Fokus auf praktische Umsetzung hinsichtlich Speicher & Laufzeit
To achieve this important milestone, CL research should not neglect the development of strategies that aim for computational efficiency and therefor also focus on the aspect of operability in memory constrained environments. For this reason, emphasis should also be placed on presenting adequate evaluation scenarios and meaningful benchmarks.
%
In this study, we show a proof-of-concept for CL that operates at a time complexity that is independent of the amount of previously acquired knowledge, which is something we also observe in humans. Clearly, more complex problems must be investigated under less restricted conditions, and a comprehensive comparison to other CL methods (EWC, LwF, GEM) will be provided in future work. Above all, we believe that more complex probabilistic models such as deep convolutional GMMs must be employed to adequately describe more complex data distributions.
%
%TODO: Reproducibility auf Page 10 OK oder Grund zum meckern von Reviewern wg. Page-Limit???