diff --git a/iclr2024_conference.tex b/iclr2024_conference.tex index 56b68dbee904ff6b119dcefe554fbdd2fff9a2c3..8ccb92b05267894de1910421695bae40aaf353a3 100644 --- a/iclr2024_conference.tex +++ b/iclr2024_conference.tex @@ -571,45 +571,47 @@ \section{Feature Encoding by foundation models} \label{app:fm} Encoding features using foundation models, i.e., frozen and pre-trained networks that transform raw-data into a higher-level and invariant representation to operate on, have shown to be beneficial for CL \cite{van2020brain,hayes2020remind,pellegrini2020latent}. A current promising direction of pre-training such models is \textit{contrastive learning}, which is performed in a supervised \cite{khosla2020supervised} (SupCon) or self-supervised fashion {\cite{caron2020unsupervised, chen2020simple,dwibedi2021little} (SSCL). In this study, we rely on SupCon to build a robust feature extractor for more complex datasets (SVHN, CIFAR). - \\ \\ + Here, we take a portion of the data from the target domain for pre-training, but exclude these instances from further usage in downstream CL tasks. For SVHN, we pull an amount equal to 0.5 of the total training samples from the "extra" split. For CIFAR10 we split the training set in half and use one for pre-training and the other for encoding and later usage in downstream CL. The data used to pre-train the foundation model are thus similar but not identical to subsequent training data, following the approach of \cite{van2020brain}. - \\ \\ + An additional data augmentation module normalizes the input, performs random horizontal flipping and rotation in the range of $-2\%*2\pi - +2\%*2\pi$ for each input image. The encoder backbone is a ResNet-50 with randomly initialized weights and is trained for $256$ epochs using a batch size of $\beta=256$. No further fine-tuning is performed after pre-training. We use the normalized activations of the final pooling layer ($D = 2048$) as the representation vector. - % + For supervised training, a projection head is attached, consisting of two hidden layers, having a total of $2048$ and $128$ projection units, followed by ReLU activation. The multi-class npairs loss \cite{sohn2016improved} uses a temperature of $0.05$ and is optimized via ADAM with a learning rate of $\eps=0.001$, $\beta_1=0.9$ and $\beta_2=0.999$. - % + After pre-training we push the complete training data through the encoder network and save the output to disk for later usage. However, it would be perfectly legitimate to use the model on-the-fly to encode the data mini-batch wise, though this comes at the cost of a worse runtime efficiency. % \section{AR training} \label{app:ar} AR employs a GMM scholar $L_{(G)}$ with $K=225$ (MNIST, FMNIST, E-MNIST, Fruits) and $K=400$ (SVHN, CIFAR) components and diagonal covariance matrices. The choice of $K$ is subject to a \enquote{the more the better} principle, and is limited only by available GPU memory. - % - GMM generator training follows the procedures and best-practice settings presented and justified in \cite{gepperth2021gradient}. Training is terminated via early stopping when $L_{(G})$ reaches a plateau of stationary loss for the current task $T_i$. We set the training epochs to $512$ as an upper bound. - Both, $L_{(G})$ and the classification head are independently optimized via vanilla SGD using a fixed learning rate of $\epsilon=0.05$. - The relative strengths of component weight and covariance matrix adaptation are set to $0.1$. %TODO: 0.1 passt? - % - Annealing controls the GMM component adaptation radius for $L_{(G)}$ via parameter $r_{0}$. It is set to $r_{0}^{init}=\sqrt{0.125K}$ for the first (initial) training on $T_1$, and $r_{0}^{replay}=0.1$ for subsequent (replay) tasks $T_{i>1}$. + + GMM generator training follows the procedures and best-practice settings presented and justified in \cite{gepperth2021gradient}. Training is terminated via early stopping when $L_{(G)}$ reaches a plateau of stationary loss for the current task $T_i$. We set the training epochs to $512$ as an upper bound. + Both, $L_{(G)}$ and the classification head are independently optimized via vanilla SGD using a fixed learning rate of $\epsilon=0.05$. + The relative strengths of component weight and covariance matrix adaptation are set to $0.1$. + + Annealing controls the GMM component adaptation radius for $L_{(G)}$ via parameter $r_{0}$. It is set to $r_{0}^{init}=\sqrt{0.125K}$ for the first (initial) training on $T_1$, and $r_{0}^{replay}=0.1$ for subsequent (replay) tasks $T_{i}, i>1$. % Sampling GMM sampling parameters $S=3$ (top-S) and $\rho=1.0$ (normalization) are kept fixed throughout all experiments. - % + Other AR hyperparameters are retained to the values showcased in \cite{gepperth2021gradient} for all experiments, since their choice is independent of any particular CL problem at hand. % Wird erstmal weg gelassen... in den Experimenten für MNIST/Fruits etc. habe ich dies verwendet weil es empriisch geholfen hat deutlichere Prototypen zu bilden, ebenso entstanden so auch weniger Degenerierte ... %One exception is adjusting $\gamma=0.96$ (GMM annealing control) for the initial (from scratch) training on $T_1$ to support the convergence of components, while reducing it back to $\gamma=0.9$ for subsequent replay-tasks $T_{i>1}$. %Here, we depart from the recommendations of \cite{gepperth2021gradient} since we are dealing with a scenario of successive learning phases. % ----------------------------------------------------- - \section{Deep Generative Replay training} - \label{app:dgr} - \begin{table}[h] + \clearpage + \section{Deep Generative Replay training}\label{app:dgr} + \begin{table} \setlength{\arrayrulewidth}{0.25mm} \renewcommand{\arraystretch}{1.25} %\setlength{\tabcolsep}{16pt} \centering \scriptsize - \begin{tabular}{ lc|lc|lc} - \textbf{Component} & \textbf{Layer} & \multicolumn{4}{:c}{...} \\ + \begin{tabular}{ lc:lc:lc} + \textbf{Component} & \textbf{Layer} & \multicolumn{4}{|c}{...} \\ + \hline \hline + & & & & & \\ \textbf{Encoder} & C2D(32,5,2)-ReLU & \textbf{Decoder} & Dense(100)-ReLU & \textbf{Solver} & C2D(32,5,1)-ReLU \\ @@ -617,20 +619,23 @@ & C2D(64,5,2)-ReLU & & Dense((H/4)*(W/4)*64)-ReLU & & MP2D(2) \\ & Flatten & & Reshape((H/4),(W/4),64)-ReLU & & C2D(64,5,1)-ReLU \\ & Dense(100)-ReLU & & C2DTr(32,5,2)-ReLU & & MP2D(2) \\ - & Dense(50)-ReLU & & C2DTr(C,5,2)-Sig. & & Flatten \\ - & Dense(25) & & & & Dense(100)-ReLU \\ + & Dense(25)-ReLU & & C2DTr(C,5,2)-Sig. & & Flatten \\ + & Dense(50) & & & & Dense(100)-ReLU \\ & & & & & Dense(10)-Softmax \\ \multicolumn{6}{c}{...} \\ \hline + \hline + %\cdashline{1-6} + & & & & & \\ \textbf{LR-Encoder} & Flatten - & \textbf{LR-Decoder} & Dense(25) + & \textbf{LR-Decoder} & Dense(100)-ReLU & \textbf{LR-Solver} & Flatten \\ - & Dense(1024)-ReLU & & Dense(100)-ReLU & & Dense(1024)-ReLU \\ - & Dense(100)-ReLU & & Dense(1024)-ReLU & & Dense(100)-ReLU \\ - & Dense(25) & & Dense(2048)-ReLU & & Dense(10)-Softmax \\ - & Dense(50) & & Reshape(N,H,W,C) & & \\ + & Dense(1024)-ReLU & & Dense(1024)-ReLU & & Dense(1024)-ReLU \\ + & Dense(100)-ReLU & & Dense(2048)-ReLU & & Dense(100)-ReLU \\ + & Dense(25)-ReLU & & Reshape(N,H,W,C) & & Dense(10)-Softmax \\ + & Dense(50) & & & & \\ \end{tabular} - \caption{DNN architectures for VAE-based replay (all components) and ER (solvers). A VAE generator consists of a mirrored encoder and decoder network. Components from the first row are utilized for MNIST, FMNIST, E-MNIST and Fruits, while the second row components are deployed for latent replay (LR) on SVHN and CIFAR. + \caption{DNN architectures for VAE-based replay (all components) and ER (solvers only). A VAE generator consists of a mirrored encoder and decoder network. Components from the first row are utilized for MNIST, FMNIST, E-MNIST and Fruits-360, while second row components are deployed for latent replay (LR) on SVHN and CIFAR. \label{tab:networkstructure} } \end{table} @@ -641,16 +646,16 @@ % Training iterations, i.e., the amount of steps over the constituted mini-batches $\beta$, are calculated dynamically for each task. This affects the balanced mixing strategy, as $D_i$ grows linearly, affecting the training duration negatively. %This is due to the fact that $\mathcal{D}_{T_{i>1}}$ is significantly smaller than $\mathcal{D}_{T_{1}}$ in a constant-setting. % - \\ \\ - The VAE latent dimension is 25, the disentangling factor $\beta=1.$, and conditional sampling is turned off for MNIST, FashionMNIST, E-MNIST and Fruits datasets, whereas it is turned on for SVHN and CIFAR to enforce that the generator naturally produces past data in equal proportions. For these datasets, we also operate on latent features and use fully-connected DNNs as encoder and decoder, see \cref{tab:networkstructure}. - The learning rate for solvers and VAE generators are $\epsilon_S=10^{-4}$, $\epsilon_G=10^{-3}$ using the Adam optimizer with $\beta_{1}=0.9$, $\beta_{2}=0.999$. Generator and solver training is performed for 100 and 50 epochs respectively. We reinitialize both structures before each new task on MNIST, FashionMNIST, E-MNIST and Fruits. + + The VAE latent dimension is 25, the disentangling factor $\beta=1.0$, and conditional sampling is turned off for MNIST, F-MNIST, E-MNIST and Fruits-360 datasets, whereas it is turned on for SVHN and CIFAR to enforce that the generator naturally produces samples from previously seen classes in equal proportions. For these datasets, we also operate on latent features and use fully-connected DNNs as encoder and decoder, see \cref{tab:networkstructure}. + + The learning rate for solvers and VAE generators are $\epsilon_S=10^{-4}$, $\epsilon_G=10^{-3}$ using the ADAM optimizer with $\beta_{1}=0.9$, $\beta_{2}=0.999$. Generator and solver training is performed for 100 and 50 epochs respectively. We reinitialize both structures before each new task on MNIST, FashionMNIST, E-MNIST and Fruits-360. %------------------------------------------------------------------------- \section{Experience Replay training}\label{app:er} - The solvers for ER are taken to be the same as for VAE-DGR, see \cref{tab:networkstructure}. The ADAM optimizer is used with a learning rate of $10^{-4}$, $\beta_{1}=0.9$, $\beta_{2}=0.999$, and the network is trained for 50 epochs on each task. - % - \\ \\ - Similar to \cite{riemer2018learning}, reservoir sampling is used to select $50$ samples of each encountered class to be stored. For replay, oversampling of the buffer is performed to obtain a number of samples, equal to the amount of data instances present in the current task $T_i$. Thus, we choose an ER implementation that has constant time complexity, although the number of distinct samples per task will decrease over time. At some point, CL will break down because there are too few distinct samples per task to protect previously acquired knowledge. - % - Analogous to the procedure for DGR, we use replay on latent feature representations, see e.g., \cite{pellegrini2020latent} encoded by a foundation model as described in \cref{app:fm} for SVHN and CIFAR. + The solvers for ER are taken to be the same as for VAE-DGR (see the rightmost column in \cref{tab:networkstructure}). The ADAM optimizer is used with a learning rate of $10^{-4}$, $\beta_{1}=0.9$, $\beta_{2}=0.999$, and the network is trained for 50 epochs on each task. Analogous to the procedure for DGR, we use replay on latent feature representations, see e.g., \cite{pellegrini2020latent} encoded by a foundation model as described in \cref{app:fm} for SVHN and CIFAR. + + Similar to \cite{riemer2018learning}, reservoir sampling is used to select $50$ samples of each encountered class to be stored. For replay, oversampling of the buffer is performed to obtain a number of samples, equal to the amount of data instances present in the current task $T_i$. + + Thus, we choose an ER implementation that has constant time complexity, although the number of distinct samples per task will decrease over time. At some point, CL will break down because there are too few distinct samples per task to protect previously acquired knowledge. % ----------------------------------------------------------------- \end{document}