Active Learning

Active learning is a machine learning technique that involves selecting the most informative data points for the purpose of training an emulator. The idea behind active learning is to reduce the amount of training data required for a machine learning model to achieve a certain level of accuracy. This is achieved by iteratively choosing a new data point that is expected to be the most informative.

In this module, the ActiveLearning class is implemented to actively build a Gaussian process emulator for the natural logarithm of the unnormalized posterior in Bayesian inference. It is supposed to facilitate efficient parameter calibration of computationally expensive simulators. For detailed theories, please refer to Wang and Li [2018], Kandasamy et al. [2017], and Zhao and Kowalski [2022].

ActiveLearning Class

The ActiveLearning class is imported by:

from psimpy.inference.active_learning import ActiveLearning

Methods

class ActiveLearning(ndim, bounds, data, run_sim_obj, prior, likelihood, lhs_sampler, scalar_gasp, scalar_gasp_trend='constant', indicator='entropy', optimizer=<function brute>, args_prior=None, kwgs_prior=None, args_likelihood=None, kwgs_likelihood=None, args_optimizer=None, kwgs_optimizer=None)[source]

Contruct a scalar GP emulator for natural logarithm of the product of prior and likelihood (i.e. unnormalized posterior), via active learning.

Parameters
  • ndim (int) – Dimension of parameter x.

  • bounds (numpy array) – Upper and lower boundaries of each parameter. Shape (ndim, 2). bounds[:, 0] corresponds to lower boundaries of each parameter and bounds[:, 1] to upper boundaries of each parameter.

  • data (numpy array) – Observed data for parameter calibration.

  • run_sim_obj (instance of class RunSimulator) – It has an attribute simulator and two methods to run simulator, namely serial_run() and parallel_run(). For each simulation, simulator must return outputs y as a numpy array.

  • prior (Callable) – Prior probability density function. Call with prior(x, *args_prior, **kwgs_prior) and return the value of prior probability density at x.

  • likelihood (Callable) – Likelihood function constructed based on data and simulation outputs y evaluated at x. Call with likelihood(y, data, *args_likelihood, **kwgs_likelihood) and return the likelihood value at x.

  • lhs_sampler (instance of class LHS) – Latin hypercube sampler used to draw initial samples of x. These initial samples are used to run initial simulations and build initial emulator.

  • scalar_gasp (instance of class ScalarGaSP) – An object which sets up the emulator structure. Providing training data, the emulator can be trained and used to make predictions.

  • scalar_gasp_trend (str or Callable, optional) – Mean function of scalar_gasp emulator, which is used to determine the trend or testing_trend at given design or testing_input. ‘zero’ - trend is set to zero. ‘constant’ - trend is set to a constant. ‘linear’ - trend is linear to design or testing_input. Callable - a function takes design or testing_input as parameter and returns the trend. Default is ‘constant’.

  • indicator (str, optional) – Indicator of uncertainty. ‘entropy’ or ‘variance’. Default is ‘entropy’.

  • optimizer (Callable, optional) – A function which finds the input point x that minimizes the uncertainty indicator at each iteration step. Call with optimizer(func, *args_optimizer, **kwgs_optimizer). The objective function func is defined by the class method _uncertainty_indicator() which have only one argument x. The optimizer should return either the solution array, or a scipy.optimize.OptimizeResult object which has the attribute x denoting the solution array. By default is set to scipy.optimize.brute().

  • args_prior (list, optional) – Positional arguments for prior.

  • kwgs_prior (dict, optional) – Keyword arguments for prior.

  • args_likelihood (list, optional) – Positional arguments for likelihood.

  • kwgs_likelihood (dict, optional) – Keyword arguments for likelihood.

  • args_optimizer (list, optional) – Positional arguments for optimizer.

  • kwgs_optimizer (dict, optional) – Keyword arguments for optimizer.

initial_simulation(n0, prefixes=None, mode='serial', max_workers=None)[source]

Run n0 initial simulations.

Parameters
  • n0 (int) – Number of initial simulation runs.

  • prefixes (list of str, optional) – Consist of n0 strings. Each is used to name corresponding simulation output file(s). If None, ‘sim0’, ‘sim1’, … are used.

  • mode (str, optional) – ‘parallel’ or ‘serial’. Run n0 simulations in parallel or in serial.

  • max_workers (int, optional) – Controls the maximum number of tasks running in parallel. Default is the number of CPUs on the host.

Return type

tuple[ndarray, ndarray]

Returns

  • init_var_samples (numpy array) – Variable input samples for n0 initial simulations. Shape of (n0, ndim).

  • init_sim_outputs (numpy array) – Outputs of n0 intial simulations. init_sim_outputs.shape[0] is n0.

iterative_emulation(ninit, init_var_samples, init_sim_outputs, niter, iter_prefixes=None)[source]

Sequentially pick niter new input points based on ninit simulations.

Parameters
  • niter (int) – Number of interative simulaitons.

  • ninit (int) – Number of initial simulations.

  • init_var_samples (numpy array) – Variable input samples for ninit simulations. Shape of (ninit, ndim).

  • init_sim_outputs (numpy array) – Outputs of ninit simulations. init_sim_outputs.shape[0] is ninit.

  • iter_prefixes (list of str, optional) – Consist of niter strings. Each is used to name corresponding iterative simulation output file(s). If None, ‘iter_sim0’, ‘iter_sim1’, … are used.

Return type

tuple[ndarray, ndarray, ndarray]

Returns

  • var_samples (numpy array) – Variable input samples of ninit simulations and niter iterative simulations. Shape of (ninit+niter, ndim).

  • sim_outputs (numpy array) – Outputs of ninit and niter simulations. sim_outputs.shape[0] is \(ninit+niter\).

  • ln_pxl_values (numpy array) – Natural logarithm values of the product of prior and likelihood at ninit and niter simulations. Shape of (ninit+niter,).

Notes

If a duplicated iteration point is returned by the optimizer, the iteration will be stopped right away. In that case, the first dimension of returned var_samples, sim_outputs, ln_pxl_values is smaller than \(ninit+niter\).

approx_ln_pxl(x)[source]

Approximate ln_pxl value at x based on the trained emulator.

Parameters

x (numpy array) – One variable sample at which ln_pxl is to be approximated. Shape of (ndim,).

Return type

A float value which is the emulator-predicted ln_pxl value at x.