diff --git a/iclr2024_conference.pdf b/iclr2024_conference.pdf index dc5c0d5b6e291b32a4194bad2cea75f10a2970e2..a7a0e174205b8f2ebf79941dd3284d6a618f9350 100644 Binary files a/iclr2024_conference.pdf and b/iclr2024_conference.pdf differ diff --git a/iclr2024_conference.tex b/iclr2024_conference.tex index c31f7572ad5cf618686f586249dd40e629f5b03f..342a8c5d67ea541e8272f9ad3264a905d0206e23 100644 --- a/iclr2024_conference.tex +++ b/iclr2024_conference.tex @@ -118,7 +118,7 @@ \subsection{Approach: AR}\label{sec:approach} Adiabatic replay (AR) prevents an ever-growing number of replayed samples by applying two main strategies: selective replay and selective updating, see \cref{fig:ar1}. Selective replay means that new data are used to query the generator for similar (potentially conflicting) samples. For achieving selective replay, we rely on Gaussian Mixture Models (GMMs) trained by SGD as introduced in \cite{gepperth2021gradient}. GMMs have limited modeling capacity, but are sufficient for the proof-of-concept that we require here \footnote{Deep Convolutional GMMs, see \cite{gepperth2021image} are a potential remedy for the capacity problem.}. - Adiabatic replay is most efficient when each task $t$, with data $\vec x \sim p^{(t)}$, adds only a small amount of new knowledge. AR models the joint distribution of past tasks as a mixture model with $K$ components, $p^{(1\dots t)}(\vec x) = \sum_k \pi_k \mathcal N(\vec x; \vec \mu_k, \mSigma_k)$, we can formalize this assumption as a requirement that only a few components in $p^{(1\dots t)}$ are activated by new data: $|\{ \argmax_k \left(\pi_k \mathcal N(\vec x_i;\vec \mu_k,\mSigma_k)\right)\forall \vec x_i \sim p^{(t)}\}| << K$. + Adiabatic replay is most efficient when each task $t$, with data $\vec x \sim p^{(t)}$, adds only a small amount of new knowledge. AR models the joint distribution of past tasks as a mixture model with $K$ components, $p^{(1\dots t)}(\vec x) = \sum_k \pi_k \mathcal N(\vec x; \vec \mu_k, \mSigma_k)$, thus we can formalize this assumption as a requirement that only a few components in $p^{(1\dots t)}$ are activated by new data: $|\{ \argmax_k \pi_k \mathcal N(\vec x_i;\vec \mu_k,\mSigma_k)\forall \vec x_i \sim p^{(t)}\}| <\!\!< K$. If this assumption is violated, AR will still work but more components will need to be updated, requiring more samples. % % In contrast to DNNs where knowledge is represented in a completely delocalized fashion, GMMs implement a semi-localized knowledge representation. @@ -146,7 +146,9 @@ \par\noindent\textbf{Deep Generative Replay} Here, (deep) generative models like GANs \cite{goodfellow2014generative} and VAEs \cite{kingma2013auto} are used for memory consolidation by replaying samples from previous tasks, see \cref{fig:genrep} and \cite{shin2017continual}. % - The recent growing interest in GR brought up a variety of architectures, either being VAE-based \cite{kamra2017deep, lavda2018continual, ramapuram2020lifelong, ye2020learning, caselles2021s} or GAN-based \cite{wu2018memory, ostapenko2019learning, wang2021ordisco, atkinson2021pseudo}. + The recent growing interest in GR brought up a variety of architectures, either being VAE-based \cite{kamra2017deep, lavda2018continual, ramapuram2020lifelong, ye2020learning, caselles2021s} or GAN-based \cite{wu2018memory, ostapenko2019learning, wang2021ordisco, atkinson2021pseudo}. + Notably, the MerGAN model \cite{wu2018memory} uses an LwF-type knowledge distillation technique to prevent forgetting in generators, which is more efficient than pure replay. + Furthermore, PASS \cite{zhu2021} uses self-supervised learning by sample augmentation in conjunction with slim class-based prototype storage for improving the performance replay-based CL. % An increasingly employed technique in this respect is \textit{latent replay} which operates on and replays latent features generated by a frozen encoder network, see, e.g., \cite{van2020brain,pellegrini2020latent,kong2023condensed}. Built on this idea are models like REMIND \cite{hayes2020remind}, which extends latent replay by the aspect of compression, or SIESTA \cite{harun2023siesta} which improves computational efficiency by alternating wake and sleep phases in which different parts of the architecture are adapted. @@ -222,7 +224,7 @@ Average forgetting is then defined as: $F_{T} = \frac{1}{t-1} \sum^{t-1}_{j=1} F^{t}_{j} \qquad F_t \in [0,1]$. % %------------------------------------------------------------------------- - \subsection{Experimental Setting} + \subsection{Experimental Setting}\label{sec:expset} % Machine description All experiments are run on a cluster of 30 machines equipped with single RTX3070Ti GPUs. % General experimental setup -> ML domain @@ -235,41 +237,44 @@ Training consists of an (initial) run on $T_1$, followed by a sequence of independent (replay) runs on $T_{i}, i>1$. % Averaged over runs & baseline experiments We perform ten randomly initialized runs for each CIL-Problem, and conduct baseline experiments for all datasets to measure the offline joint-class training performance. We set the training mini-batch size to $\beta=100$ ($\beta=50$ for the Fruits dataset). - % - \begin{table}[h!] - \scriptsize - \renewcommand{\arraystretch}{.9} - \centering - \begin{tabular}{ c:c: c | c | c | c | c | c } - & task split & {$T_1$} & {$T_2$} & {$T_3$} & {$T_4$} & {$T_5$} & {$T_6$} \\[0.2ex] - \cdashline{1-8} - & & & & & & \\ - \multirow[c]{3}{*}[0in]{\rotatebox{90}{CIL-P.}} & \multicolumn{1}{:c:}{D5-$1^5$A} & [0-4] & 5 & 6 & 7 & 8 & 9 \\ - %\hline - % D6-$1^4$A & [0-5] & 6 & 7 & 8 & 9 & / \\ - % D6-$1^4$B & [4-9] & 0 & 1 & 2 & 3 & / \\ - % \hline - & \multicolumn{1}{:c:}{D7-$1^3$A} & [0-6] & 7 & 8 & 9 & / & / \\ - %\hline - & \multicolumn{1}{:c:}{D20-$1^5$A} & [0-19] & 20 & 21 & 22 & 23 & 24 \\ - \end{tabular} - \hspace{0.25cm} - \begin{tabular}{ c: c | c | c | c | c | c } - task split & {$T_1$} & {$T_2$} & {$T_3$} & {$T_4$} & {$T_5$} & {$T_6$} \\[0.2ex] - \cdashline{1-7} - & & & & & & \\ - D5-$1^5$B & [5-9] & 0 & 1 & 2 & 3 & 4 \\ - %\hline - D7-$1^3$B & [3-9] & 0 & 1 & 2 & / & / \\ - %\hline - D20-$1^5$B & [5-24] & 0 & 1 & 2 & 3 & 4 \\ - \end{tabular} - \caption{Showcase of the investigated CL/CIL-Problems. Each task $T_i$ contains image and label data $(X,Y)$ from each of the corresponding classes. - D20-$1^5$A and D20-$1^5$B are exclusive for E-MNIST. + + From the benchmarks of \cref{sec:data}, we construct the following CIL problems by spliiting the datasets into classes as follows: D5-$1^5$A (6 tasks, 0-4,5,6,7,8,9), D5-$1^5$B (6 tasks, 5-9,0,1,2,3,4), D7-$1^3$A (4 tasks, 0-6,7,8,9), D7-$1^5$B (4 tasks, 3-9,0,1,2), D20-$1^5$A (6 tasks, 0-19,20,21,22,23,24, EMNIST only) and D2-$2^5$ (5 tasks, 0-1,2-3,4-5,6-7,8-9). + % + %\begin{table}[h!] +% \scriptsize +% \renewcommand{\arraystretch}{.9} +% \centering +% \begin{tabular}{ c:c: c | c | c | c | c | c } +% & task split & {$T_1$} & {$T_2$} & {$T_3$} & {$T_4$} & {$T_5$} & {$T_6$} \\[0.2ex] +% \cdashline{1-8} +% & & & & & & \\ +% \multirow[c]{3}{*}[0in]{\rotatebox{90}{CIL-P.}} & \multicolumn{1}{:c:}{D5-$1^5$A} & [0-4] & 5 & 6 & 7 & 8 & 9 \\ +% %\hline +% % D6-$1^4$A & [0-5] & 6 & 7 & 8 & 9 & / \\ +% % D6-$1^4$B & [4-9] & 0 & 1 & 2 & 3 & / \\ +% % \hline +% & \multicolumn{1}{:c:}{D7-$1^3$A} & [0-6] & 7 & 8 & 9 & / & / \\ +% %\hline +% & \multicolumn{1}{:c:}{D20-$1^5$A} & [0-19] & 20 & 21 & 22 & 23 & 24 \\ +% \end{tabular} +% \hspace{0.25cm} +% \begin{tabular}{ c: c | c | c | c | c | c } +% task split & {$T_1$} & {$T_2$} & {$T_3$} & {$T_4$} & {$T_5$} & {$T_6$} \\[0.2ex] +% \cdashline{1-7} +% & & & & & & \\ +% D5-$1^5$B & [5-9] & 0 & 1 & 2 & 3 & 4 \\ +% %\hline +% D7-$1^3$B & [3-9] & 0 & 1 & 2 & / & / \\ +% %\hline +% D20-$1^5$B & [5-24] & 0 & 1 & 2 & 3 & 4 \\ +% +% \end{tabular} +% \caption{Showcase of the investigated CL/CIL-Problems. Each task $T_i$ contains image and label data $(X,Y)$ from each of the corresponding classes. +% D20-$1^5$A and D20-$1^5$B are exclusive for E-MNIST. %Initial task $T_1$ data is balanced w.r.t classes. - \label{tab:slts} - } - \end{table} +% \label{tab:slts} +% } +% \end{table} For AR, selective replay of $D_i$ samples is performed before training on task $T_{i}, i>1$ using the current scholar $S_{i-1}$, where $D_i$ represents the amount of training samples contained in $T_i$. For DGR, replay of $D_i$ samples is likewise performed before training on task $T_i$. This replay strategy keeps the amount of generated samples \textit{constant w.r.t the number of tasks}, and thus comes with modest temporary storage requirements instead of growing linearly with an increasing amount of incoming tasks.