Skip to content
Snippets Groups Projects
Commit b0c6a0ac authored by Sebastian Kerger's avatar Sebastian Kerger
Browse files

Upload New File

parent 371c7d92
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Script to optimize the simulation parameters for a water quality model in MIKE+
The nevergrad package is used for optimization.
Note:
- mikepluspy requires a full license for mike+ to run. In many cases the seat of the license is not properly released; closing python and waiting some time should fix this.
- With multithreading results may take some time to show up in the output, even if the cell is supposed to be completed.
Other scripts to run:
- first run optim_prep if wanted
- scripts creates the reference data and also the optimn_parts excel which contains the timeframes for the optimization
Things to do in MIKE+:
Water quality models:
- activate Transport in "Model Type"
- Set at least one WQ component
- Set the SWQ advanced methods
- double check on the method unter Build up wash off
- check WQ boundary conditions and that the WQ component is the WQ component set above
- activate Transport in Simulation setup
Sediment transport models:
- activate Sediment Transport in "Model Type"
- set all relevant parameters under sediment transport
General Parameters: Density and porosity
Sediment fractions: Insert a new Fraction and use a cohesive sediment fraction as non cohesive are not fully supported yet
- change WQ boundary conditions to type "Sediment concentration" and type in the name of the sediment transport component defined above
- under SWQ advanced methods: Change surface load type to "Sediment" and choose the method you want to use
- check that the result files of ST save the concentration
- run all models with the ST results to get the test results for the script below
- check the optimn_parts excel and make sure the results files are the ST files and that the column name is changed to concentration
- activate Sediment Transport in Simulation setup
%% Cell type:code id: tags:
``` python3
# import packages
import typing
import datetime
import shutil
from time import sleep
from warnings import warn
from pathlib import Path
import typing
import mikeio
import numpy as np
import pandas as pd
import nevergrad as ng
import matplotlib as mpl
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import IPython.display as idsp
# import simulation scripts from the project ENTfrachtEN
from entfrachten.optim import simulation, loss_functions, optim_dumper
```
%% Cell type:markdown id: tags:
## Constants, you should not need to adjust these, unless you are debugging.
%% Cell type:code id: tags:
``` python3
# output heaps of additional data, may be useful for debugging
VERBOSE_OUTPUT = False
# delete the temp folder at end
DELETE_TEMP = True
```
%% Cell type:markdown id: tags:
## Constants, you should adjust
%% Cell type:code id: tags:
``` python3
# Number of simulations allowed for optimization
optimization_budget = 100
# Number of seats for the Engine, limits amount of simulations we can run at once. Default of MIKE+ Licence is 4
engine_seats = 4
# set a correation factor to multiply the data of the measurements if necessary. Default 1. ENTfrachtEN wastewater = 3.3757
factor_underestimation = 1
# time string for temp and output folder
# if temp_folder exists, this script tries to continue the optimization.
# Note that continuing the optimization does not check whether settings match;
# do not change anything when continuing an optimization run!
time_path_part = datetime.datetime.now().strftime("%Y-%m-%dT%H.%M.%S")
model_folder = Path("example_folder/")
# path of .mupp file, relative to model_folder
db_path_rel = "example.mupp"
# path of table which specifies which parts of the simulation result should be read, relative to model_folder.
# from optimization_prep.ipnyb
result_spec_path = "optim_parts.xlsx"
# path of reference file, relative to model_folder
# This is expected to be a table with 3 Columns: "Label", "Datetime" and a data column, where the name does not matter.
# Label has to be consistent with result_spec
reference_path = "reference.csv"
# path of parameter configuration table, relative to model_folder.
# Specifies which parameters are optimized an their allowed range
param_spec_path = "parameter_config.xlsx"
# choose optimizers
# for all options:
# print(sorted(ng.optimizers.registry.keys()))
optimizers = ["BayesOptimBO"]
# loss_fn is later used for optimization, use this line to choose a loss function.
loss_fn = loss_functions.mean_negative_bounded_nse
# this value is later used to penalize errors. A simulation result with only nan would returns exactly this value.
loss_on_error = 1 # alternative:loss_fn(reference, -reference)
display(loss_on_error)
```
%% Cell type:markdown id: tags:
## create temp(orary) and out(put) folders for the simulation
%% Cell type:code id: tags:
``` python3
# a folder where used files are saved
# note that this should be in the same folder as the model folder, so that relative paths lead to the same files
temp_folder = Path("tmp_" + time_path_part)
# output folder
out_folder = Path("out_" + time_path_part)
# create output folder
out_folder.mkdir(exist_ok=True, parents=True)
if not (temp_folder.exists()):
# make a copy of the model folder, since the program might mess with the model
shutil.copytree(model_folder, temp_folder)
should_continue_optimization = False
print("starting new optimization.")
else:
should_continue_optimization = True
print("continuing previous optimization.")
db_path = temp_folder / db_path_rel
```
%% Cell type:code id: tags:
``` python3
# display the temp folder. If an error occurs, it may be useful to check it.
# The folder path is also needed to resume an interrupted optimization.
temp_folder.absolute()
```
%% Cell type:markdown id: tags:
## read the input data for the simulation
%% Cell type:code id: tags:
``` python3
def file_to_df(path: Path) -> pd.DataFrame:
"""Tries to read the file at path as a pandas DataFrame."""
if path.suffix == ".xlsx":
return pd.read_excel(path)
elif path.suffix == ".dfs0":
return mikeio.read(filename=path).to_dataframe()
elif path.suffix == ".csv":
return pd.read_csv(path)
else:
raise NotImplementedError(f"No method for {path.suffix} implemented.")
```
%% Cell type:code id: tags:
``` python3
# read events to be simulated
result_spec = file_to_df(temp_folder / result_spec_path).set_index("Label")
result_spec["ResultReach"] = result_spec["ResultReach"].fillna("").astype(str)
result_spec["ResultNode"] = result_spec["ResultNode"].fillna("").astype(str)
assert (result_spec["Start"] <= result_spec["End"]).all(
axis=None
), "Event end needs to be after start."
result_spec["SimulationName"] = result_spec["SimulationName"].astype(str)
assert result_spec.index.is_unique, "Need unique labels"
result_spec
```
%% Cell type:code id: tags:
``` python3
# read reference data
reference = file_to_df(temp_folder / reference_path)
# parse dates and remove timezone
reference["Datetime"] = pd.to_datetime(reference["Datetime"]).dt.tz_localize(None)
spec_labels = set(result_spec.index.unique())
ref_labels = set(reference["Label"])
if spec_labels!=ref_labels and ref_labels.issuperset(spec_labels):
warn("reference has labels not in spec, dropping superfluous labels.")
reference=reference.query("Label in @ spec_labels")
if spec_labels!=ref_labels and ref_labels.issubset(spec_labels):
raise Exception("Referene is missing lables specified by result_spec.")
reference = reference.set_index(["Label", "Datetime"])
# convert to series and change type to 32-bit float
reference = reference.iloc[:,0].astype("float32")
reference = reference * factor_underestimation
reference
```
%% Cell type:code id: tags:
``` python3
# reference contains shifted times, adjust start/end in event table
for label in result_spec.index:
ref_part = reference.xs(label, level="Label")
result_spec.loc[label, "Start"] = ref_part.index.min()
result_spec.loc[label, "End"] = ref_part.index.max()
result_spec[["Start", "End"]]
```
%% Cell type:code id: tags:
``` python3
# specify which parameters are optimized and in which range they should be
# could make the table in python like this
# param_specs = pd.DataFrame([
# {"Table":"msm_HParA","Muid":DEF_STRING,"Column":"RedFactor","Min":.4,"Max":1,"Steps":4},
# {"Table":"msm_HParA","Muid":DEF_STRING,"Column":"InitLoss","Min":0,"Max":5*M_TO_MM,"Steps":4}
# ])
# or load from a file
param_spec = file_to_df(temp_folder / param_spec_path)
# give params_specs a sensible string index, this is later used to pass optimization parameters as a dictionary
param_spec.index = pd.Index(
param_spec[["Table", "Column", "Muid"]].agg(";".join, axis="columns"),
name="ParamKey",
)
assert (
param_spec.index.is_unique
), "Index needs to be unique. Otherwise indicates dupplicates in the parameters."
param_spec
```
%% Cell type:code id: tags:
``` python3
db_path
```
%% Cell type:markdown id: tags:
## Create Simulation runs
%% Cell type:code id: tags:
``` python3
assert (
db_path.exists()
), f"DB not found, check path. Expected file in {db_path.absolute()}"
sim_runner = simulation.SimulationRunner(db_path=db_path,result_folder=temp_folder,param_spec=param_spec,result_spec=result_spec,parallel_limit=engine_seats)
# Test dummy
#sim_runner = simulation_dummy.SimulationDummy(reference=reference,param_spec=param_spec,result_spec=result_spec)
```
%% Cell type:markdown id: tags:
## Read test results (not necessary)
%% Cell type:code id: tags:
``` python3
# Read results which are already in folder.
# This serves as a test for the result specification.
# This is skipped if optimization was started before.
if not should_continue_optimization:
existing_result = sim_runner.read_results()
existing_result.index.names = reference.index.names
print(
"Plotting results, if this looks very different from plots in mike, unit conversion may need to be added."
)
for label in result_spec.index:
try:
txs = existing_result.xs(label, level="Label")
rxs = reference.xs(label, level="Label")
missing_ratio = txs.reindex_like(rxs).isna().mean()
if missing_ratio > 0.2:
warn(
f"Label {label} has many missing values compared to the reference missing ratio is {missing_ratio} "
)
pd.DataFrame({"ref": rxs, "test": txs}).plot(
title=str(label), figsize=(7, 2.5)
)
except Exception as ex:
warn(str(ex))
```
%% Cell type:markdown id: tags:
## make a test run to see if the optmization works
%% Cell type:code id: tags:
``` python3
# test run
test_res_df = pd.DataFrame({"Reference":reference})
test_res_info = pd.DataFrame()
status_dh = idsp.display("starting",display_id=True)
info_dh = idsp.display("",display_id=True)
fig_dh = idsp.display("figure placeholder",display_id=True)
for key in ["Default", "Min", "Max"]:
status_dh.update(f"Getting results for {key}")
csv_path = out_folder / f"test_{key.lower()}.csv"
if not should_continue_optimization:
# the ** in the call unrolls the series into keyword parameters"
test_res = sim_runner.simulation_run(**param_spec[key])
test_res.to_csv(csv_path, index=True)
else:
assert (
csv_path.exists()
), "Tried to continue optimization run, but csv from test does not exist. This may mean that optimization was not started before."
test_res = pd.read_csv(
csv_path, index_col=[0, 1], parse_dates=True
).iloc[:, 0]
info = test_res.describe()
test_res = test_res.reindex_like(reference)
info["missing ratio"] = test_res.reindex_like(reference).isna().mean()
info["loss"] = loss_fn(reference, test_res)
test_res_info[key]=info
test_res_df[key]=test_res
info_dh.update(pd.DataFrame(test_res_info))
status_dh.update(f"Plotting results for {key}")
plt.ioff()
label_group = test_res_df.groupby(level="Label")
fig, axes = plt.subplots(nrows=len(label_group), figsize=(6, 2.5 * len(label_group)),layout="constrained",squeeze=False)
for row,(label, df) in enumerate(label_group):
df.droplevel("Label").plot(ax=axes[row,0], title=str(label))
axes[row,0].xaxis.label.set_visible(False)
fig.suptitle("Simulation with different parameters")
fig.savefig(out_folder/"test_sim.png",bbox_inches="tight")
fig_dh.update(fig)
plt.ion()
#plt.close(fig)
# check results for equality. This test assumes that the default may be equal to either min or max.
# But if all are the same, something is wrong with the simulation.
assert not (
test_res_df["Default"].equals(test_res_df["Min"]) and test_res_df["Default"].equals(test_res_df["Max"])
), "Test results are equal. This may mean that parameters were not set at all."
status_dh.update(f"Finished")
```
%% Cell type:code id: tags:
``` python3
test_res_df.isna().mean()
```
%% Cell type:markdown id: tags:
## set all optimization parameters and contraints
%% Cell type:code id: tags:
``` python3
# define the parameter calibration as a minimization problem
def simulation_compare(**kwparams):
"""runs simulation with the specified parameters and uses the loss function on the results.
Note that kwparams keys need to match the index of param_specs."""
result = sim_runner.simulation_run(**kwparams)
result_matched = result.reindex_like(reference)
plain_loss = loss_fn(reference, result_matched)
# values should not be missing, penalize by increasing the loss
missing_ratio = result_matched.isna().mean()
if missing_ratio > 0:
warn(f"Values are missing from result:{missing_ratio}")
return (1 - missing_ratio) * plain_loss + missing_ratio * loss_on_error
# define constrains for the parameters, takes parameters in the same format as simulation_compare
# the nevergrad developers recommend to use float valued constraints, such that >=0 is ok
# https://facebookresearch.github.io/nevergrad/optimization.html#optimization-with-constraints
# this function takes the parameter values as they are defined by nevergrad
# for instrumentation it contains a tuple of args and a dict of kwargs
def constraint_min_max(params_val):
"""Constraint: mindispcoef should be smaller than maxdispcoef"""
mindispcoef = params_val[1]["msm_ADDispersion;mindispcoef;Global AD dispersion"]
maxdispcoef = params_val[1]["msm_ADDispersion;maxdispcoef;Global AD dispersion"]
return maxdispcoef - mindispcoef
# define optimization parameters as needed by nevergrad
ng_params = {}
# first make a list of scalar values with information from param_specs
for step in param_spec.index:
vmin = param_spec.at[step, "Min"]
vmax = param_spec.at[step, "Max"]
vdef = param_spec.at[step, "Default"]
ngp = ng.p.Scalar(init=vdef, lower=vmin, upper=vmax)
ng_params[step] = ngp
# now turn the list into an instrumentation
# the instrumentation is used to tell the optimizer how the paremeters are passed to our function
instrumentation = ng.p.Instrumentation(**ng_params)
instrumentation
```
%% Cell type:code id: tags:
``` python3
# register constraint
instrumentation.register_cheap_constraint(constraint_min_max)
```
%% Cell type:code id: tags:
``` python3
def optim_run(optim_name):
# optim_dumper is used to save/load the optimization state
state_dumper = optim_dumper.OptimDumper(temp_folder/f"dmp-{optim_name}")
has_dump = state_dumper.get_last_step() >= 0
if has_dump != should_continue_optimization:
print("No dump exists for ", optim_name)
if not should_continue_optimization or not has_dump:
# create optimizer
print("starting new optimization for ", optim_name)
optimizer = ng.optimizers.registry[optim_name](
parametrization=instrumentation, budget=optimization_budget, num_workers=1
)
# tested parameters and loss are saved to list for later evaluation.
params = []
losses = []
else:
# load the last dumped optimization state
print("loading optimizer ", optim_name)
optimizer, params, losses = state_dumper.load()
# do optimization
for step in tqdm(range(optimizer.num_tell, optimizer.budget)):
# next parameter set to simulate
candidate = optimizer.ask()
loss = simulation_compare(*candidate.args, **candidate.kwargs)
# tell optimizer the loss value
optimizer.tell(candidate, loss)
# keep parameters and loss in a list
params.append(candidate)
losses.append(loss)
# save optimization state to disk
state_dumper.dump(optimizer, params, losses)
# optimization has finished. now get recommendation and write result into dataframe
recommendation = optimizer.recommend()
# params contains the candidates as an instrumentation object.
# The list comprehension creates a list of dicts of parameter values,
# which pandas then transforms into a dataframe where each parameter has its own column.
optim_run_df = pd.concat(
[
pd.DataFrame(data=[p.value[1] for p in params]),
pd.DataFrame({"Optimizer": optimizer.name, "Loss": losses}),
],
axis="columns",
)
# also add a line for the recommendation
rec_params = recommendation.value[1]
optim_run_df.loc["recommendation", rec_params.keys()] = rec_params
optim_run_df.loc["recommendation", "Loss"] = recommendation.loss
optim_run_df.loc["recommendation", "Optimizer"] = optimizer.name
optim_run_df.to_csv(out_folder / f"optim_run{optimizer.name}.csv")
return optim_run_df
```
%% Cell type:markdown id: tags:
## start the optimization
%% Cell type:code id: tags:
``` python3
optim_run_dfs = {}
for o in optimizers:
optim_run_dfs[o] = optim_run(o)
display(optim_run_dfs[o])
optim_run_dfs[o].plot(subplots=True, figsize=(5, 2 * len(param_spec)))
```
%% Cell type:markdown id: tags:
## save all optimization results
%% Cell type:code id: tags:
``` python3
# concatenate optim_run_dfs and output into a csv.
# optim_run_df already has optimizer included, use .values instead of passing the dict, so that the optimizer is not added again.
optim_run_df = pd.concat(optim_run_dfs.values())
optim_run_df.to_csv(out_folder / "optim_run.csv")
optim_run_df.to_excel(out_folder / "optim_run.xlsx")
```
%% Cell type:code id: tags:
``` python3
# adjust index for evaluation
optim_run_df.index.name = "Step"
optim_run_df = optim_run_df.reset_index().set_index(["Optimizer", "Step"])
```
%% Cell type:code id: tags:
``` python3
# Plot the loss for each step. This indicates how fast (and if at all) the optimizer converged to a solution.
# Note that some optimizers are not designed to converge.
optim_run_df["Loss"].unstack(level="Optimizer").plot(
subplots=True, sharey=True, title="loss over steps"
)[0].get_figure().savefig(out_folder / "loss_over_steps.png", bbox_inches="tight")
```
%% Cell type:code id: tags:
``` python3
# Shows the cumulative minimum of the loss over the steps.
# Compared to the plain loss plot, cumulative minimum filters out the effect of exploration.
# This plot can help to judge how fast the optimizer found a good solution.
optim_run_df["Loss"].unstack(level=0).cummin().plot(
title="cummin loss over steps"
).get_figure().savefig(out_folder / "loss_cummin.png", bbox_inches="tight")
```
%% Cell type:code id: tags:
``` python3
# Bar plot comparing the loss of the recommendation from each optimizer.
# This is not interesting if only one optimizer was used.
ax = optim_run_df["Loss"].unstack(level=0).loc["recommendation"].plot(kind="bar")
ax.bar_label(ax.containers[0])
optim_run_df["Loss"].xs("recommendation", level="Step")
```
%% Cell type:code id: tags:
``` python3
# runs the simulation for the recommended parameters from each optimizer.
# The results are saved as a csv together with the reference values.
# Plots are also created, which can be used to judge the quality of the simulation.
res_dict = {"ref": reference}
for key, df in optim_run_df.groupby(level="Optimizer"):
ps = df.loc[(key, "recommendation"), param_spec.index]
display(ps)
test_res_def = sim_runner.simulation_run(**ps)
res_dict[key] = test_res_def
best_res = pd.concat(res_dict, axis="columns")
best_res.to_csv(out_folder / "best_res.csv")
label_group = best_res.groupby(level="Label")
fig, axes = plt.subplots(nrows=len(label_group), figsize=(6, 2.5 * len(label_group)),layout="constrained",squeeze=False)
for row,(label, df) in enumerate(label_group):
df.droplevel("Label").plot(ax=axes[row,0], title=str(label))
fig.suptitle("Optimizer Result and Reference")
fig.savefig(out_folder / "best_res.png", bbox_inches="tight")
```
%% Cell type:code id: tags:
``` python3
# measure how much the parameters are adjusted
# first normalize to range than take norm2 of difference to previous step
# geometric interpretation: project parameters into hyper unit cube and calculate the velocity
# only look at params and exclude recommendation
run_param_df = optim_run_df.loc[
optim_run_df.index.get_level_values("Step") != "recommendation", param_spec.index
]
normalized = (run_param_df - param_spec["Min"]) / (
param_spec["Max"] - param_spec["Min"]
)
velocity = normalized.diff()
# norm2 (root of sum of squares)
velocity = (velocity**2).sum(axis="columns") ** 0.5
velocity.unstack(level=0).plot(title="velocity of parameter adjustments")
velocity.unstack(level=0).plot(
subplots=True, sharey=True, title="velocity of parameter adjustments"
)
velocity.unstack(level=0).plot(
subplots=True, sharey=False, title="velocity of parameter adjustments"
)
```
%% Cell type:markdown id: tags:
## close simulation
%% Cell type:code id: tags:
``` python3
# clean up used resources
sim_runner.close()
# delete temp folder
if DELETE_TEMP:
shutil.rmtree(temp_folder)
# print the recommendations
for key, df in optim_run_df.groupby(level="Optimizer"):
display(key, df.loc[(key, "recommendation"), param_spec.index])
```
%% Cell type:code id: tags:
``` python3
# for testing compare to ideal parameters
recs = {}
if hasattr(sim_runner,"ideal_params"):
recs["ideal"] = sim_runner.ideal_params
for key, df in optim_run_df.groupby(level="Optimizer"):
recs[key]=df.loc[(key, "recommendation"), param_spec.index]
pd.DataFrame(recs)
```
%% Cell type:code id: tags:
``` python3
```
%% Cell type:code id: tags:
``` python3
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment