A major task in environmental science is to obtain spatially comprehensive data from limited field samples (e.g. climate stations, soil profiles, vegetation records,...). This is often done using machine learning algorithms that learn the relationships between field data and remotely sensed predictor variables (e.g. from satellites). The developed model is then used to make spatial predictions for the entire area of interest (i.e. create a "map" of the variable of interest).
Such a map is only valuable when the error of the model is known. The error assessment however causes major difficulties for models with spatial dependencies and cause standard validation methods to fail. There is increasing consent in literature that spatial validation is necessary and several strategies have been proposed (e.g. Roberts 2018, Meyer 2018, Valavi 2018, Brenning 2012). However, little research is done on how different strategies compare, which however is important to know for model comparisons and for finding the "right" validation strategy for a dataset.
This project aims at comparing and (as far as possible) evaluating different validation strategies for machine learning based spatial mapping of environmental variables. Usually in most projects the "true" performance is hard to assess due to the fact that the reference data are limited. Therefore we want to adress the problem using reference data that are available in a spatially continuous way hence providing a continuous reference. Such data are rare but a possible research task would be available in the field of rainfall monitoring: The RADOLAN dataset contains high quality rainfall data from a radar network in 1km spatial resolution and serves as an excellent reference set. We then want to model rainfall for Germany using data from the the geostationary satellite sensor MSG SEVIRI and auxiliary predictor variables such as elevation (see Kühnlein 2014 for the idea of mdoelling rainfall based on MSG SEVIRI).
Using the RADOLAN data we simulate different sampling designs (hence we simulate raingauges in different constellations: random, clustered,...) and train machine learning models to predict rainfall from MSG SEVIRI in a spatial way. Different strategies for spatial model evaluation are then compared.
Requirements:
R programming is an advantage, interest in working with machine learning algorithms
References:
Brenning, A. (2012): Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest. IEEE International Geoscience and Remote Sensing Symposium.
Kühnlein, M., T. Appelhans, B. Thies, and T. Nauss, 2014: Precipitation Estimates from MSG SEVIRI Daytime, Nighttime, and Twilight Data with Random Forests. Journal of Applied Meteorology and Climatology, 53 (11), 2457–2480.
Meyer, H., Reudenbach, C., Hengl, T., Katurji, M., Nauss, T. (2018): Improving performance of spatio-temporal machine learning models using forward feature selection and target-oriented validation. Environmental Modelling & Software 101: 1-9.
Roberts, D. R., V. Bahn, S. Ciuti, M. S. Boyce, J. Elith, G. Guillera-Arroita, S. Hauenstein, J. J. Lahoz-Monfort, B. Schröder, W. Thuiller, D. I. Warton, B. A. Wintle, F. Hartig, and C. F. Dormann, 2017: Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, doi:10.1111/ecog.02881.
Valavi R, Elith J, Lahoz‐Monfort JJ, Guillera‐Arroita G. blockCV: An r package for generating spatially or environmentally separated folds for k‐fold cross‐validation of species distribution models. Methods Ecol Evol. 2018;00:1–8. https://doi.org/10.1111/2041-210X.13107
Autor: Marc Dragunski
Betreuer: Hanna Meyer