This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: SEIT, University of New South Wales, Canberra, Australia
11email: t.dam@student.adfa.edu.au, {s.anavatti,h.abbas}@adfa.edu.au
22institutetext: STEM, University of South Australia, Adelaide, Australia 22email: dhika.pratama@unisa.edu.au33institutetext: ATMRI, Nanyang Technological University, Singapore
33email: mdmeftahul.ferdaus@ntu.edu.sg

Scalable Adversarial Online Continual Learning SupplementaryMaterialsSupplementary~{}Materials

Tanmoy Dam 1 equal contribution    Mahardhika Pratama\Letter2 0    MD Meftahul Ferdaus 3 0    Sreenatha Anavatti 1    Hussein Abbas1

This supplementary document is comprised of following two sections.

  • Section 1: Properties of our proposed SCALE along with baselines with respect to online continual learning setting.

  • Section 2: hyperparameter configurations for all the methods.

1 Task Specifications

In this work, task orders are not randomized or optimized. In case of SCIFAR-10, SCIFAR-100, and SMINIIMAGENET, same order of tasks as in the original datasets are maintained, whereas, it is random (by default) in pMNIST. Main features of our proposed SCALE along with baselines with respect to online continual learning setting is presented in Table 1.

Table 1: Properties of our proposed SCALE along with baselines with respect to online continual learning setting
Method Episodic Memory Task-Specific Parameters Require Task ID During Inference Store Historical Params/Grads/Logits (training) Store Historical Params (testing)
GEM \checked ×\times \checked \checked ×\times
MER \checked ×\times \checked \checked ×\times
MIR \checked ×\times \checked ×\times ×\times
ER \checked ×\times \checked \checked ×\times
CTN \checked \checked \checked \checked ×\times
SCALE \checked \checked \checked \checked ×\times

2 Hyperparameter Specifications

Hyperparameter settings of consolidated methods are presented as follows:

  1. 1.

    GEM

    1. 1.1.

      Learning Rate: 0.03 (PMNIST, SCIFAR-100 and SCIFAR-10) and 0.05 ( SMINIIMAGENET)

    2. 1.2.

      Number of gradient updates: 1 (all benchmarks)

    3. 1.3.

      Margin for QP: 0.5 (all benchmarks)

  2. 2.

    MER

    1. 2.1.

      Learning Rate: 0.03 (PMNIST) and 0.05 (SCIFAR-100, SCIFAR-10 and SMINIIMAGENET),

    2. 2.2.

      Replay batch size: 64 (all benchmarks)

    3. 2.3.

      Reptile rate β\beta: 0.3 (all benchmarks)

    4. 2.4.

      Number of gradient updates: 3

  3. 3.

    MIR

    1. 3.1.

      Learning Rate: 0.03 (all benchmarks)

    2. 3.2.

      Replay batch size: 10 (all benchmarks)

    3. 3.3.

      Number of gradient updates: 3

  4. 4.

    ER

    1. 4.1.

      Learning Rate: 0.03 (all benchmarks)

    2. 4.2.

      Replay batch size: 10 (all benchmarks)

    3. 4.3.

      Number of gradient updates: 3

  5. 5.

    CTN

    1. 5.1.

      Inner learning rate: 0.03 (PMNIST) 0.01 (SCIFAR-100, SCIFAR-10 and SMINIIMAGENET)

    2. 5.2.

      Outer learning rate: 0.1 (PMNIST) 0.05 (SCIFAR-100, SCIFAR-10 and SMINIIMAGENET)

    3. 5.3.

      Number of inner & outer updates: 2 (all benchmarks)

    4. 5.4.

      Temperature and weight for KL: 5, 100 (all benchmarks)

    5. 5.5.

      Replay batch size: 64 (all benchmarks)

    6. 5.6.

      Semantic memory percentage: 20%

  6. 6.

    SCALE

    1. 6.1.

      Inner learning rate: 0.1 (PMNIST) 0.01 (SCIFAR-100, SCIFAR-10 and SMINIIMAGENET)

    2. 6.2.

      Outer learning rate: 0.01 (PMNIST) 0.1 (SCIFAR-100, SCIFAR-10 and SMINIIMAGENET)

    3. 6.3.

      Adversarial learning rate: 0.001 (PMNIST, SCIFAR-100, SCIFAR-10 and SMINIIMAGENET)

    4. 6.4.

      Number of inner & outer updates: 1 (all benchmarks)

    5. 6.5.

      Number of discriminator update: 1 (all benchmarks)

    6. 6.6.

      Weights of λ1,λ2=1,λ3=0.03\lambda_{1},\lambda_{2}=1,~{}\lambda_{3}=0.03 (all benchmarks)

    7. 6.7.

      Replay batch size: 64 (all benchmarks)

When doing cross-validation across the three validation tasks, grid search is performed to ensure that each hyper-parameter is consistent across all three tasks, which will not be seen during continuous learning..

  • λ1,λ2={1,3}\lambda_{1},\lambda_{2}=\{1,3\}

  • λ3={0.03,0.09,0.3,0.9}\lambda_{3}=\{0.03,0.09,0.3,0.9\}

  • α={0.001,0.01,0.1}\alpha=\{0.001,0.01,0.1\}

  • β={0.001,0.01,0.1}\beta=\{0.001,0.01,0.1\}