diff --git a/cl_bib_ak.bib b/cl_bib_ak.bib index b71a49561e727fa3588ad1fcc0f6662f3db4cc54..43c9e37c68cab5b81dcc97a666a808615e5f6f6f 100755 --- a/cl_bib_ak.bib +++ b/cl_bib_ak.bib @@ -1,3 +1,24 @@ +@article{mcclelland2020integration, + title={Integration of new information in memory: new insights from a complementary learning systems perspective}, + author={McClelland, James L and McNaughton, Bruce L and Lampinen, Andrew K}, + journal={Philosophical Transactions of the Royal Society B}, + volume={375}, + number={1799}, + pages={20190637}, + year={2020}, + publisher={The Royal Society} +} + + +@article{klasson23, +author={Klasson, M and Kjellström, H and Zhang, C}, +journal={Transactions on Machine Learning Research}, +year={2023}, +volume={9}, +title={Learn the Time to Learn: Replay Scheduling in Continual Learning}, +} + + % SVHN % @article{netzer2011reading, title={Reading digits in natural images with unsupervised feature learning}, diff --git a/iclr2024_conference.tex b/iclr2024_conference.tex index ce5a6be2d02fce51612a0d109b6276ba45837d84..afca03d3f7c7ed9dc3c3f44d693f9cd3e273cd8b 100644 --- a/iclr2024_conference.tex +++ b/iclr2024_conference.tex @@ -84,7 +84,7 @@ \section{Introduction}\label{sec:intro} This contribution is in the context of continual learning (CL), a recent flavor of machine learning that investigates learning from data with non-stationary distributions. A common effect in this context is catastrophic forgetting (CF), an effect where previously acquired knowledge is abruptly lost after a change in data distributions. - In the default scenario for class-incremental CL (see, e.g., \cite{bagus2022beyond,van2022three}), a number of assumptions are made in order to render CL more tractable: first of all, distribution changes are assumed to be abrupt, partitioning the data stream into stationary \textit{tasks}. Then, task onsets are supposed to be known, instead of inferring them from data. And lastly, tasks are assumed to be disjoint, i.e., not containing the same classes in supervised scenarios. Together with this goes the constraint that no, or only a few, samples may be stored. + In class-incremental CL (see, e.g., \cite{bagus2022beyond,van2022three}), a number of assumptions are made: distribution changes are assumed to be abrupt, partitioning the data stream into stationary \textit{tasks}. Then, task onsets are supposed to be known, instead of inferring them from data. Lastly, tasks are assumed to be disjoint. Together with this goes the constraint that no, or only a few, samples may be stored. % % the default scenario for class-incremental supervised CL. The data stream is assumed to be partitioned into \textit{tasks} $T_i$. Statistics within a task are considered stationary, and task data and labels (targets) are assumed to be disjoint, i.e., from different classes (MNIST classes used here for visualization). Task onsets are assumed to be known as well. \begin{figure}[h] @@ -118,8 +118,10 @@ %TODO: cosntant-time replay %---------------- \subsection{Approach: AR}\label{sec:approach} - Adiabatic replay (AR) prevents an ever-growing number of replayed samples by applying two main strategies: selective replay and selective updating, see \cref{fig:ar1}. Selective replay means that new data are used to query the generator for similar (potentially conflicting) samples. For achieving selective replay, we rely on Gaussian Mixture Models (GMMs) trained by SGD as introduced in \cite{gepperth2021gradient}. GMMs have limited modeling capacity, but are sufficient for the proof-of-concept that we require here \footnote{Deep Convolutional GMMs, see \cite{gepperth2021image} are a potential remedy for the capacity problem.}. + Adiabatic replay (AR) prevents an ever-growing number of replayed samples by applying two main strategies: selective replay and selective updating, see \cref{fig:ar1}. Selective replay means that new data are used to query the generator for similar (potentially conflicting) samples. For achieving selective replay, we rely on Gaussian Mixture Models (GMMs) trained by SGD as introduced in \cite{gepperth2021gradient}. GMMs have limited modeling capacity, but are sufficient here since we are working with pre-trained feature extractors. + AR is partly inspired by maximally interfered retrieval (MIR), proposed in \cite{mir} where a fixed replay budget (either for experience replay or generative replay) is composed of the most conflicted samples, those that would be unlearned most rapidly when training on a new task. In a similar vein, \cite{mcclelland2020integration} hypothesize that just replaying samples that are similar to new ones could be sufficient to avoid forgetting. Another inspiration comes from \cite{klasson23}, where it is shown that replaying the right data at the right moment is preferable to replaying everything. + Adiabatic replay is most efficient when each task $t$, with data $\vec x \sim p^{(t)}$, adds only a small amount of new knowledge. AR models the joint distribution of past tasks as a mixture model with $K$ components, $p^{(1\dots t)}(\vec x) = \sum_k \pi_k \mathcal N(\vec x; \vec \mu_k, \mSigma_k)$, thus we can formalize this assumption as a requirement that only a few components in $p^{(1\dots t)}$ are activated by new data: $|\{ \argmax_k \pi_k \mathcal N(\vec x_i;\vec \mu_k,\mSigma_k)\forall \vec x_i \sim p^{(t)}\}| <\!\!< K$. If this assumption is violated, AR will still work but more components will need to be updated, requiring more samples. % @@ -148,7 +150,7 @@ \par\noindent\textbf{Deep Generative Replay} Here, (deep) generative models like GANs \cite{goodfellow2014generative} and VAEs \cite{kingma2013auto} are used for memory consolidation by replaying samples from previous tasks, see \cref{fig:genrep} and \cite{shin2017continual}. % - The recent growing interest in GR brought up a variety of architectures, either being VAE-based \cite{kamra2017deep, lavda2018continual, ramapuram2020lifelong, ye2020learning, caselles2021s} or GAN-based \cite{wu2018memory, ostapenko2019learning, wang2021ordisco, atkinson2021pseudo}. + The recent growing interest in GR brought up a variety of architectures, either being VAE-based \cite{kamra2017deep, lavda2018continual, ramapuram2020lifelong, ye2020learning, caselles2021s} or GAN-based \cite{ostapenko2019learning, wang2021ordisco, atkinson2021pseudo}. Notably, the MerGAN model \cite{wu2018memory} uses an LwF-type knowledge distillation technique to prevent forgetting in generators, which is more efficient than pure replay. Furthermore, PASS \cite{zhu2021} uses self-supervised learning by sample augmentation in conjunction with slim class-based prototype storage for improving the performance replay-based CL. % @@ -156,7 +158,7 @@ \cite{van2020brain,pellegrini2020latent,kong2023condensed}. Built on this idea are models like REMIND \cite{hayes2020remind}, which extends latent replay by the aspect of compression, or SIESTA \cite{harun2023siesta} which improves computational efficiency by alternating wake and sleep phases in which different parts of the architecture are adapted. % \par\noindent\textbf{MIR} - Maximally interfered retrieval (MIR), proposed in \cite{mir} is an approach to class-incremental CL where a fixed replay budget (either for experience replay or generative replay) is focused on samples that undergo the highest loss increase when training on a new task. For DGR, the authors propose a gradient-based algorithm to specifically generate such samples using a VAE generator. Conceptually, this is similar to the concept of selective replay, although a key difference is that our GMM generator/solver is capable of selective updating as well. We will use MIR as one of the baselines for our experiments. + Conceptually, this is similar to the concept of selective replay, although a key difference is that our GMM generator/solver is capable of selective updating as well. We will use MIR as one of the baselines for our experiments. % %------------------------------------------------------------------------- \section{Methods}\label{sec:gmm} % Übersicht über Methoden am Anfang. Refs auf appendix bei Literatur-MEthoden wie ER, DGR und foundation models. AR ausführlich erklären @@ -588,7 +590,7 @@ Summing up, all currently proposed approaches to CL show linear time complexity, often made worse by linear memory complexity. % keep \par\noindent\textbf{Issues with constant-time replay} Many recently proposed methods operate in a constant-time regime, of samples for each new task. To balance a costant number $S$ of samples, either from memory or from a generator, w.r.t. samples from the current task, - the former are given a higher weight in the loss \cite{mir}. There are several issues with this: First of all, for a large number of tasks, each task will be less and less represented in $S$ samples, making eventual forgetting inevitable, while weights for past samples grow higer and higher. Then, giving past samples a higher weight effectively increases the learning rate for these samples, which can break SGD if the weights are too high. Alernatively, the weight for the current samples is \textit{reduced} from its baseline value in some works \cite{vandeven}, ultimately leading to low learning rates and thus long training times. And lastly, the precise weights are generally set post-hoc via cross-validation \cite{mir,mergan}, which is inadmissible for CL because it amounts to knowing all tasks beforehand. AR can use constant-time replay without weighting past samples due to selective updating and selective replay. + the former are given a higher weight in the loss \cite{mir}. There are several issues with this: First of all, for a large number of tasks, each task will be less and less represented in $S$ samples, making eventual forgetting inevitable, while weights for past samples grow higer and higher. Then, giving past samples a higher weight effectively increases the learning rate for these samples, which can break SGD if the weights are too high. Alernatively, the weight for the current samples is \textit{reduced} from its baseline value in some works \cite{van2020brain}, ultimately leading to low learning rates and thus long training times. And lastly, the precise weights are generally set post-hoc via cross-validation \cite{mir,wu2018memory}, which is inadmissible for CL because it amounts to knowing all tasks beforehand. AR can use constant-time replay without weighting past samples due to selective updating and selective replay. % \par\noindent\textbf{Violation of AR assumptions} The assumption that new tasks only add a small contribution is not a hard requirement, just a prerequsite for sample efficiency. Based on the formalization presented in \cref{intro}, its validity is trivial to verify by examining component activations of the GMM generator when faced with new data. Although we do not implement such a control strategy here, AR would simply need to replay more samples if contributions should be large. However, the chances of this happening in practice are virtually zero if the body of existing knowledge is sufficiently large.