% discussion: reinit or train on for DGR, replay in general.
% discussion kürzer
% referenzen auf balanced ggf anpassen
% conclusion
\documentclass{article}% For LaTeX2e
...
...
@@ -263,10 +264,9 @@
\label{tab:slts}
}
\end{table}
We set the training mini-batch size to $\beta=100$ ($\beta=50$ for the Fruits dataset). Selective replay of $D_i$ samples is performed before training on task $T_{i}, i>1$ using the current scholar $S_{i-1}$, where $D_i$ represents the amount of training samples contained in $T_i$.
This strategy keeps the number of generated samples constant w.r.t the number of tasks, and thus comes with modest temporary storage requirements instead of growing linearly with an increasing amount of incoming tasks.
We set the training mini-batch size to $\beta=100$ ($\beta=50$ for the Fruits dataset). For AR, selective replay of $D_i$ samples is performed before training on task $T_{i}, i>1$ using the current scholar $S_{i-1}$, where $D_i$ represents the amount of training samples contained in $T_i$. For DGR, replay of $D_i$ samples is likewise performed before training on task $T_i$. This replay strategy keeps the amount of generated samples \textit{constant w.r.t the number of tasks}, and thus comes with modest temporary storage requirements instead of growing linearly with an increasing amount of incoming tasks.
When replaying, mini-batches of $\beta$ samples are randomly drawn, in equal proportions, from the real samples from task $T_i$ and the generated samples representing previous tasks.
When replaying, mini-batches of $\beta$ samples are randomly drawn, in equal proportions, from the real samples from task $T_i$ and the generated/retained samples representing previous tasks.
%For training, mini-batches are randomly drawn from this resulting merged subset $\mathcal{D}_{T_i}$.
It is worth noting that classes will, in general, \textit{not} be balanced in the merged generated/real data at $T_i$, and that it is not required to store the statistics of previously encountered class instances/labels.
First, we demonstrate the ability of a trained GMM to query its internal representation through data samples and selectively generate artificial data that \enquote{best match} those defining the query. To illustrate this, we train a GMM layer of $K=25$ components on MNIST classes 0, 4 and 6 for 50 epochs using the best-practice rules described in \cref{app:ar}. Then, we query the trained GMM with samples from class 9 uniquely, as described in \cref{sec:gmm}. The resulting samples are all from class 4, since it is the class that is \enquote{most similar} to the query class. These results are visualized in \cref{fig:var}. Variant generation results for deep convolutional extensions of GMMs can be found in \cite{gepperth2021new}, emphasizing that the AR approach can be scaled to more complex problems.
\par\noindent\textbf{Baseline and initial task performance:}
We observe superior (non CL) test accuracy $\alpha^{base}$ for DGR and ER on all datasets except Fruits where the results are identical, see \cref{tab:short_results} (bottom part). This is especially clear for experiments where the scholar is confronted with \enquote{raw} input data. A possible reasoning behind this is, that DGR and ER benefit from their internal CNN structure which is inherently capable of efficiently capturing the distribution of high-dimensional image data and becomes less prone to invariance.
%
On the other hand, AR relies on a considerably less complex structure in it's current state. Furthermore, it should be noted that DGR and ER use more trainable parameters, especially on raw input, e.g., 3.7 and 4.125 times (DGR Fruits, DGR latent features), as well as, 4.7 and 0.375 times (ER Fruits, ER latent features) as much as AR.
In this main experiment, we evaluate the CL performance of AR w.r.t. measures given in \cref{sec:exppeval}, and compare its performance to DGR and ER since these represent principled approaches to replay-based CL. Results are tabulated in \cref{tab:short_results}.
%
\par\noindent\textbf{Baseline and initial task performance:}
We observe superior joint training (i.e., non-CL) test accuracy $\alpha^{base}$ for DGR and ER on all datasets except Fruits where the results are identical, see \cref{tab:short_results} (bottom part). This is especially clear for experiments where the scholar is confronted with \enquote{raw} input data. A possible reasoning behind this is, that DGR and ER benefit from their internal CNN structure which is inherently capable of efficiently capturing the distribution of high-dimensional image data and becomes less prone to invariance.
% TODO: warum benutzt ER auf latent features so wenig parameter?
On the other hand, AR relies on a considerably less complex structure in its current state. Furthermore, it should be noted that DGR and ER use significantly more trainable parameters, especially when operating on raw input. For DGR, the ratio is 3.7 when applied to RGB data and 4.125 when applied to latent features. For ER, the ratio is 4.7 for RGB and 0.375 for latent features.
%
The ability to perform well in joint-class training, may also directly translate to a better starting point for CL with DGR and ER due to the initial task $T_1$ being constituted from a large body of classes in this experimental evaluation.
The ability to perform well in joint-class training may also directly translate to a better starting point for CL with DGR and ER due to the initial task $T_1$ being constituted from a large body of classes in this experimental evaluation.
%
For this reason we finde the Fruits dataset to be a valuable benchmark, since it is simple enough to be solved with a high accuracy by AR in the baseline condition. Therefore, comparisons of CL performance are not biased by an initial difference in classification accuracy.
For this reason we find the Fruits-360 dataset to be a valuable benchmark, since it is high-dimensional yet simple enough to be solved to high accuracy by AR in the baseline condition. Therefore, comparisons of CL performance are not biased by any initial difference in classification accuracy.
%
For SVHN and CIFAR, we observe a similar situation with only minor differences, as the encoded feature representations inherently have a higher degree of linear separability.
For SVHN and CIFAR, we observe a similar situation with only minor differences to Fruits-360, as the encoded feature representations inherently have a higher degree of linear separability.
%
% PARAMs:
%
...
...
@@ -325,32 +327,35 @@
%
%
% Ab hier ... bissl DGR bashen h3h3
\par\noindent\textbf{Imposing a memory budget:}
This constraint is especially critical for DGR and becomes apparent from the conducted CL studies. It appears that DGR suffers from to some degree for all datasets studied. However, forgetting worsens as the number of tasks with shrinking data additions increases, as confirmed by the experiments with different task sequence lengths (D5-$1^5$, D7-$1^5$, D20-$1^5$), This is also observed for ER on e.g. FMNIST, E-MNIST.
\par\noindent\textbf{Constant-time replay is problematic for DGR:}
This constraint is especially detrimental for DGR performance, which is apparent from the experiments conducted in this section, see \cref{tab:short_results}. It appears that, to some degree, DGR suffers from catastrophic forgetting for all datasets under investigation. However, forgetting worsens as the number of tasks increases. This is confirmed by experiments with different task sequence lengths (D5-$1^5$, D7-$1^5$, D20-$1^5$). To a lesser extent, this is also observed for ER on e.g. FMNIST, E-MNIST.
%
In addition, we argue that even when a balanced scenario w.r.t tasks/classes is ensured, it may not protect the DGR to suffer from tremendous forgetting. This is linked to the strong tendency of the VAE to shift it's internal representation towards the data distribution of the most recent task. This implication may not be compensated by solely increasing the amount of generated data, as past classes are automatically diluted over time without employing appropriate countermeasures for the generative structure. More so, this would significantly increase the (re-)training time of the scholar with each task addition.
In addition, we argue that even when a balanced scenario w.r.t tasks/classes is ensured, it may not protect DGR from significant CF. This is linked to the strong tendency of the VAE to shift its internal representation towards the most recent task. This tendency may not be compensated by increasing the amount of generated data, as the representation of past classes is inherently degraded over time by replay artifacts. More so, this would significantly increase the (re-)training time of the scholar with each task addition.
%
In contrast, AR is conceptualized on the premise to work well when the amount of generated samples is kept constant w.r.t an ever-increasing number of tasks, see\cref{fig:gen_samples_loglik_plot}for an visual example comparing DGR in a balanced setting versus adiabatic replay.
In contrast, AR is specifically designed to work well when the amount of generated samples is kept constant for each task in an ever-increasing number of tasks.\Cref{fig:gen_samples_loglik_plot}shows the development of generated sample counts over time for AR and DGR, respectively.
%
\par\noindent\textbf{CL Performance and Latent replay:}
Generally, ER shows good results on all datasets and often outperforms AR on raw inputs. Obviously, the comparison is biased in favor of ER since AR does not get to see any real samples from past tasks. Rather, ER serves as a baseline of what can be reasonably expected from AR, and we observe that this baseline is generally quite well egalized.
\par\noindent\textbf{ER --vs-- AR}
Generally, ER shows good results on all datasets and often outperforms AR when operating on raw inputs (MNIST, FMNIST,
Fruits-360 and E-MNIST datasets), although the differences are not striking, and the performance of DGR is significantly inferior still. Besides the strong differences in model complexity, a comparison between ER and AR is biased in favor of ER since AR does not get to see any real samples from past tasks. Rather, ER serves as a baseline of what can be reasonably expected from AR, and we observe that this baseline is generally quite well egalized.
%
On the other hand, ER has the disadvantage that memory usage grows with each added task, which is a unrealistic premise in practice. A fixed memory mitigates this problem, but has the negative effect that samples from previous sub-tasks are overwritten as soon as the global budget limit is reached.
On the other hand, ER has the disadvantage that training time and memory usage grow slowly but linearly with each added task, which is a unrealistic premise in practice. A fixed memory budget mitigates this problem, but has the negative effect that samples from long-ago sub-tasks will be lost over time, which will render ER ineffective if the number of tasks is large.
%
% Ergebnisse von LR kurz zusammmen fassen, was lässt sich beobachten
% CF für DGR, kackt total ab...
% AR leidet nur unter signifikantem Forgetting bei CIFAR-10; T1 (Klasse 4-9) nimmt stark ab nach T5 (Klasse 2)
For latent replay ()SVHN and CIFAR), the resulting data shows that DGR suffers from catastrophic forgetting, although having the same baseline as latent ER and AR. Forgetting for AR seems to only be significant for CIFAR D5-$1^5$B after task $T_5$, due to a high overlap with classes from initial task $T_1$.
\par\noindent\textbf{Latent replay/latent AR: }
For latent replay (SVHN and CIFAR), the results (see \cref{tab:short_resul, upper part}) show that DGR universally suffers from catastrophic forgetting although having the same baseline performance $\alpha^{\text{base}}$ as latent ER and AR. Forgetting for AR seems to only be significant for CIFAR D5-$1^5$B after task $T_5$, due to a high overlap with classes from initial task $T_1$.
% Argument: ER nur schlecht weil budget zu klein
Moreover, it is surprising to see that AR is able to show better classification results for CL experiments than latent ER. It could be argued that the budget per class for a complex dataset like SVHN and CIFAR10 is rather small, as it can be assumed that this will increase CL performance in this way. However, we stress again that this is not trivially applicable in scenarios with a constrained memory budget.
Moreover, it is surprising to see that latent AR is able to achieve generally better results than latent ER. It could be argued that the budget per class for a complex dataset like SVHN and CIFAR-10 is rather small, and it can be assumed that increasing the budget would increase CL performance. However, we stress again that this is not trivially applicable in scenarios with a constrained memory budget.
%
\par\noindent\textbf{CF and selective replay:}
AR shows promising results in terms of knowledge retention, or prevention of forgetting, for sequentially learned classes, as reflected by generally lower average forgetting. We observed very little loss of knowledge on the first task $T_1$ after full training, suggesting that AR's ability to handle small incremental additions/updates to the internal knowledge base over a sequence of tasks is an intrinsic property, due to the selective replay mechanism.
Moreover, AR demonstrates its intrinsic ability to limit unnecessary overwrites of past knowledge by performing efficient \textit{adiabatic updates}, instead of having to replay the entire accumulated knowledge each time.
Moreover, AR demonstrates its intrinsic ability to limit unnecessary overwrites of past knowledge by performing efficient \textit{selective updates}, instead of having to replay the entire accumulated knowledge each time.
%
\par\noindent\textbf{Selective updates:}
As performed by AR training, are mainly characterized by matching GMM components with arriving input. Therefore, performance on previous tasks only decreases moderately through the adaptation of selected/similar units, as shown by low forgetting rates on almost every investigated CIL-problem in \cref{tab:short_results}. This implies that the GMM tends to converge towards a \textit{trade-off} between past knowledge and new data. This effect is observed in successive (replay-)training for two classes sharing a high similarity in the input space, as e.g. seen for, FMNIST D5-$1^5$A, where task $T_2$ (class: \enquote{sandals}) and task $T_4$ (class: \enquote{sneakers}) compete for internal capacity.
%
% -------------------------
% -------------------------
\begin{table}
\scriptsize
%\footnotesize
...
...
@@ -492,12 +497,13 @@
&&&&&&&\\
\end{tabular}
\end{center}
\caption{The first and second table display the experimental results of all investigated methods (AR, DGR and ER) for each CIL-problem as final test-set accuracies $\alpha_T$ and average forgetting measures $F_T$.
\caption{Main experimental results. The first and second table display the results of all investigated methods (AR, DGR and ER) for each class-incremental learning (CIL) problem, giving final test-set accuracies $\alpha_T$ and average forgetting measures $F_T$.
The relevant baselines $\alpha^\text{base}$ (joint-training) are showcased in the bottom table.
All results are averaged across $N=10$ runs.
Detailed information about the evaluation and experimental setup can be found in \cref{sec:exppeval}.
\label{tab:short_results}}
\end{table}
% ---------------------
\begin{figure}[h!]
\centering
\begin{subfigure}{.4\textwidth}
...
...
@@ -506,7 +512,7 @@
%\caption{A subfigure}
%\label{fig:sub2}
\end{subfigure}%
\caption{Amount of generated samples per task for E-MNIST20-1 (AR/VAE in a balanced scenario)}
\caption{Amount of generated samples per task for E-MNIST D20-1 (AR/VAE in a balanced scenario)}
%Right: Successive tasks have no significant overlap. This is shown using the negative GMM log-likelihood for AR after training on task $T_i$ and then keeping the GMM fixed. As we can observe, log-likelihood universally drops, indicating a poor match.