Physics-aware AutoML for Earth Observation Data

Laurens Arp

PhD project

Supervisors:

Dr. Mitra Baratchi
Prof.dr. Holger Hoos
Prof.dr. Peter van Bodegom

Contact

l.r.arp@liacs.leidenuniv.nl

Publications

Laurens Arp, Peter M. van Bodegom, Holger H. Hoos and Mitra Baratchi. Characterising the Ill-posedness of PROSAIL Inversion for Biophysical Parameter Retrieval.European Journal of Remote Sensing, 59(1), 2026. view publication PDF BibTeX ∙ citations

Laurens Arp, Holger Hoos, Peter van Bodegom, Alistair Francis, James Wheeler, Dean van Laar and Mitra Baratchi. Training-free thick cloud removal for Sentinel-2 imagery using value propagation interpolation.ISPRS Journal of Photogrammetry and Remote Sensing, 216:168–184, 2024. view publication PDF BibTeX ∙ citations

Laurens Arp, Mitra Baratchi and Holger Hoos. VPint: value propagation-based spatial interpolation. Data Mining and Knowledge Discovery, 36:1647–1678, 2022. view publication PDF BibTeX ∙ citations

Laurens Arp, Dyon van Vreumingen, Daniela Gawehns and Mitra Baratchi. Dynamic macro scale traffic flow optimisation using crowd-sourced urban movement data. In 2020 21st IEEE International Conference on Mobile Data Management (MDM), 168–177. IEEE, 2020. PDF BibTeX citations

Physics-aware machine learning (PA-ML) is a broad category of approaches where machine learning is combined in some way with the rich domain knowledge on physical systems that is already available to us.

When working with Earth observation (EO) data, we are often interested in variables that represent key parameters in complex physical systems, such as climate, ocean currents and ecosystems, while we do not necessarily have access to reliable ground truth data. This is a strong motivation to incorporate physical domain knowledge into machine learning systems.

PA-ML in a nutshell

One of the most popular examples of PA-ML is the use of physics-informed neural networks (PINNs). In these networks, existing physical models are usually parameterised using a neural network. It is also possible to incorporate physical constraints (such as the conservation of mass constraint) into the training procedure of a neural network, to encourage physically consistent outputs.

The type of PA-ML we chiefly focus on is the combination of physical simulation models with machine learning techniques. There are various ways in which simulation models can be incorporated into machine learning pipelines. For example, we could use the simulation model to generate data. We could then train a machine learning model to “emulate” this simulator, at a fraction of the computational cost. We could also create “hybrid models” to improve extrapolation performance (a key weakness of deep learning models) by adding additional loss terms for non-observed data points put through the simulator, thereby preventing a model from fitting the observed data well only to completely break down outside of the training distribution. Finally, because physical models are generally founded on core principles of cause and effect, we can use machine learning to learn a mapping function from observed effects to inferred caused (“model inversion”).

AutoML and PA-ML

AutoML has a number of natural interactions with PA-ML in EO problem settings. First, EO data is naturally inconsistent, with large gaps in satellite data (e.g., due to cloud cover) and ground truth data (e.g., due to spatio-temporally scattered measurements and practical limitations). The data can also be very noisy, potentially resulting in large performance drops depending on the sensitivity of the algorithms to this noise. One of the objectives in AutoML is to ensure consistency throughout the ML pipeline automatically, rather than throwing away data or coming up with quick, ad-hoc manual solutions (like mosaicking outdated pixel values onto cloudy pixels in an image). In PA-ML, in particular, physical models can be highly sensitive to such data limitations (since they were developed for idealised, ‘clean’ settings), and may not function correctly without addressing these issues.

Second, inference through model inversion is a very common task when working with physical models. Since, as explained above, these models map from cause (like the concentration of chlorophyll in leaves, which turns them green) to effect (like the light spectra measured by satellites), but effects are necessarily the part that we can actually observe. However, performing model inversion can be an ill-posed problem, because multiple possible solutions (hypothesised causes) could explain our observations (observed effects) equally well. This is a type of problem that is of high relevance to the AutoML community, to see if this ill-posedness could be reduced through, for example, automatically selecting appropriate training samples relevant to a study area.

Research projects

Laurens has worked on the following research projects as a part of the AutoAI for PA-ML project.

Approximating the viable solution set for model inversion problems through ε-manifolds

Currently under review; more info later!

Physical model inversion ill-posedness for vegetation parameter estimation

Estimating vegetation parameters from remotely sensed data is a tricky business. We cannot use fully data-driven approaches, because our ground truth data is insuficient in terms of volume, scope and, arguably most importantly, reliability. As a result, many vegetation parameter estimation methods rely on the inversion of a physical model called PROSAIL, which is a radiative transfer model (RTM) simulating light spectra based on input vegetation parameters. As in most model inversion problems, this problem is considered ill-posed, because there are many possible solutions for the same observed light spectrum.

However, we found that PROSAIL inversion meets all the requirements of well-posedness, and the unique solution can be quite reliably found. The problem is that the problem is highly sensitive to noise (ill-conditioned). When combining this with the inevitable noise of EO data, the signal of the parameter–spectrum relationship is easily overwhelmed by the noise on the spectral observations, resulting in many different estimation results being possible for the same ground-level conditions. Therefore, while the PROSAIL model inversion problem is not ill-posed, the parameter estimation problem is — and moving to fully data-driven approaches, not reliant on PROSAIL, would be unlikely to address this. You can read our full analysis in our European Journal of Remote Sensing paper.

Fast, adaptable and training-free Sentinel-2 cloud removal using VPint2

Under normal circumstances between 50% to 80% of our satellite observations cannot be used for most practical purposes, because they can only observe clouds in the atmosphere rather than the ground-level conditions we are actually interested in. Worse, these cloudy images can be spatio-temporally clustered: grey, rainy Dutch winters see a lot of clouds, while sunny Mediterrainian summers produce many clean images. We could run a downstream prediction model on the most recent clear image we do have available, but if that image was obtained 5 months ago (again, rainy Dutch winter), the resulting predictions will be highly inaccurate.

Using VPint2, we can use the available information we do have (gaps in the clouds) to reconstruct a cloud-free image, using our previous cloud-free reference image not as a source of pixel values, but as a source of information on the spatial structure of the geographical area. The method is highly accurate (see the experimental results in our ISPRS Journal of Photogrammetry and Remote Sensing paper), can be run in two lines of code (see our blog post), and can be applied to whatever satellite platform a user is interested in. Because, unlike deep learning-based solutions, we do not rely on a large, representative dataset to be available for exactly the same sensor, resolution and preprocessing, VPint2 is highly adaptable and easy to use.

Spatial interpolation using VPint

Reliable ground truth data is a precious, rare thing in many EO settings. Even if there is a way to measure a certain variable directly (for example, measuring stations for fine dust concentrations), we usually do not have the resources to perform those measurements at every 10m x 10m grid cell around the world, every minute or so. Whether the data is its own goal or we need training data for a machine learning model, we are often interested in measurements covering a full area, while the measurements only cover a limited set of individual point measurements.

We can use spatial interpolation techniques to turn these point measurements into a full grid containing estimated values for every location on the grid. Interpolation can, however, be highly computationally intensive, while simple approaches that scale better to large datasets tend to achieve such efficiency benefits at the expense of quality. Moreover, most methods involve a tradeoff between local and global fidelity, where only one can be prioritised at a time.

In VPint, the known values in a grid are recursively propagated throughout the unknown values of the entire grid, enabling a complex system of mutually interacting grid cells emerging from a simple, efficient update rule. Our experiments (which can be found in the associated Data Mining and Knowledge Discovery paper) found VPint to achieve strong numerical performance, while scaling much better to larger datasets than common baseline methods.