Variable Loss Scaling Fails With Anemoi-datasets Rename
Introduction
Anemoi is a powerful tool for training machine learning models, particularly in the context of meteorological forecasting. However, like any complex system, it is not immune to bugs and issues. In this article, we will explore a specific problem that arises when using the GeneralVariableLossScaler
with anemoi-datasets created using the rename
function.
What Happened?
When anemoi-datasets rename
has been used during dataset creation, the self.variable_metadata_extractor.get_group_and_level(variable_name)
function returns the wrong name. This is a critical issue because it affects the way the GeneralVariableLossScaler
functions.
To illustrate this point, let's consider an example. Suppose we have a variable named 10u
that has been renamed using the rename
function. When we call self.variable_metadata_extractor.get_group_and_level('10u')
, we would expect it to return the original name 10u
. However, due to the bug, it returns ('sfc', 'U_10M', None)
instead.
What Are the Steps to Reproduce the Bug?
To reproduce this bug, follow these steps:
- Create an anemoi-dataset using the
rename
function. - Train a model using the
anemoi-training
command with the--config-dir
and--config-name
options. - Observe the error message that indicates the
GeneralVariableLossScaler
has failed.
Version
The version of anemoi-training that is affected by this bug is 0.4.0.post29.
Platform (OS and Architecture)
This bug has been observed on Linux platforms.
Relevant Log Output
The relevant log output from the error message is as follows:
Traceback (most recent call last):
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/train/train.py", line 500, in main
AnemoiTrainer(config).train()
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/train/train.py", line 487, in train
self.model,
^^^^^^^^^^
File "/users/rradev/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/functools.py", line 998, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/train/train.py", line 198, in model
model = model_task(**kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/train/forecaster/interpolator.py", line 59, in __init__
super().__init__(
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/train/fore/forecaster.py", line 104, in __init__
self.scalers, self.delayed_scaler_builders = create_scalers(
^^^^^^^^^^^^^^^
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/losses/scalers/scalers.py", line 36, in create_scalers
scalers[name] = scaler_builder.get_scaling()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/losses/scalers/base_scaler.py", line 81, in get_scaling
scaler_values = self.get_scaling_values()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/losses/scalers/variable.py", line 111, in get_scaling_values
self.weights.get(
AssertionError: Variable 10u is not allowed to have a separate scaling besides U_10M.
Accompanying Data
Unfortunately, there is no accompanying data available for this issue.
Organisation
This bug has been reported by MeteoSwiss, a leading meteorological organization.
Conclusion
Q: What is the cause of the GeneralVariableLossScaler
failure when using an anemoi-dataset created with the rename
function?
A: The cause of the failure is due to a bug in the self.variable_metadata_extractor.get_group_and_level(variable_name)
function, which returns the wrong name for variables that have been renamed using the rename
function.
Q: What is the expected behavior of the self.variable_metadata_extractor.get_group_and_level(variable_name)
function?
A: The expected behavior of the self.variable_metadata_extractor.get_group_and_level(variable_name)
function is to return the original name of the variable, even if it has been renamed using the rename
function.
Q: What is the impact of the bug on the GeneralVariableLossScaler
?
A: The bug causes the GeneralVariableLossScaler
to fail when trying to scale variables that have been renamed using the rename
function. This is because the GeneralVariableLossScaler
relies on the correct name of the variable to perform the scaling.
Q: How can I reproduce the bug?
A: To reproduce the bug, follow these steps:
- Create an anemoi-dataset using the
rename
function. - Train a model using the
anemoi-training
command with the--config-dir
and--config-name
options. - Observe the error message that indicates the
GeneralVariableLossScaler
has failed.
Q: What is the version of anemoi-training that is affected by this bug?
A: The version of anemoi-training that is affected by this bug is 0.4.0.post29.
Q: What is the platform (OS and architecture) that is affected by this bug?
A: The platform (OS and architecture) that is affected by this bug is Linux.
Q: What is the relevant log output from the error message?
A: The relevant log output from the error message is as follows:
Traceback (most recent call last):
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/train/train.py", line 500, in main
AnemoiTrainer(config).train()
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/train/train.py", line 487, in train
self.model,
^^^^^^^^^^
File "/users/rradev/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/functools.py", line 998, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/train/train.py", line 198, in model
model = model_task(**kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/train/forecaster/interpolator.py", line 59, in __init__
super().__init__(
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/train/forecaster/forecaster.py", line 104, in __init__
self.scalers, self.delayed_scaler_builders = create_scalers(
^^^^^^^^^^^^^^^
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/losses/scalers/scalers.py", line 36, in create_scalers
scalers[name] = scaler_builder.get_scaling()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/losses/scalers/base_scaler.py", line 81, in get_scaling
scaler_values = self.get_scaling_values()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/mch/rradev/venv/interpolator/lib/python3.12/site-packages/anemoi/training/losses/scalers/variable.py", line 111, in get_scaling_values
self.weights.get(
AssertionError: Variable 10u is not allowed to have a separate scaling besides U_10M.
Q: What is the impact of this bug on the user?
A: The impact of this bug on the user is that they may experience errors when trying to train models using the anemoi-training
command. This can be frustrating and may require additional time and effort to resolve.
Q: How can I report this bug?
A: To report this bug, please submit a ticket to the anemoi-training issue tracker with the relevant details, including the version of anemoi-training, the platform (OS and architecture), and the relevant log output from the error message.