Pixxel’s Water Quality Retrieval Algorithm
GLORIA Field Dataset for water quality
A key foundation of Pixxel’s water quality algorithm is the use of GLORIA, a globally representative in situ hyperspectral dataset. GLORIA (Global Reflectance community dataset for Imaging and optical sensing of Aquatic environments) compiles 7,572 curated water reflectance spectra measured at 1 nm intervals from 350–900 nm, collected from inland and coastal waters worldwide. Each spectrum in GLORIA is paired with field-based water quality measurements, providing ground truth for model training and validation. Notably, the dataset includes co-located values for chlorophyll-a (a proxy for algal biomass), total suspended solids (TSS), absorption by colored dissolved organic matter (CDOM), and water clarity metrics like Secchi depth. (Chlorophyll-a and CDOM absorption are indicators of phytoplankton density and dissolved organic content respectively, while TSS relates to turbidity.) By capturing contributions from 450 different water bodies across a range of trophic and optical conditions, GLORIA provides a robust, real-world basis to train algorithms. This field dataset spans a diversity of water types – from clear oligotrophic lakes to turbid, eutrophic estuaries – ensuring that models learn the bio-optical variability characteristic of inland and near-shore waters. The rich spectral-resolution of GLORIA’s reflectance (hyperspectral Rrs measurements) and its comprehensive documentation make it an ideal reference for remote sensing of water quality. Using GLORIA, the Pixxel algorithm benefits from ground-truth spectra and water constituent values needed to calibrate and validate predictive models. The GLORIA dataset has also been used to develop state-of-the-art global water quality models using multispectral sensors. [1,2,3,4]
SWIPE Synthetic Spectral Library for Data Augmentation
In addition to field measurements, Pixxel’s approach leverages synthetic spectral data to broaden the range of conditions the model can handle. For this, it utilizes principles from NASA’s SWIPE (Spectral Water Inversion Processor and Emulator) framework – a synthetic hyperspectral library designed for aquatic remote sensing. SWIPE is essentially a high-fidelity spectral simulator that generates realistic water reflectance spectra under diverse environmental conditions. It uses advanced radiative transfer and particle optical modeling (incorporating known optical properties of water constituents) to produce large volumes of simulated reflectance data. This includes variations in phytoplankton species and pigment composition, suspended sediment concentrations, and CDOM levels, as well as different water depths and atmospheric conditions. By leveraging recent advancements in optical modeling and big-data analytics, SWIPE creates a “synthetic training ground” for algorithm development. In practice, this means thousands of hyperspectral spectra can be synthesized for combinations of chlorophyll, TSS, CDOM, and even specific algal pigments beyond what might be available from field campaigns. For example, SWIPE’s engine can rapidly generate tens of thousands of 1-nm resolution reflectance spectra spanning 80 phytoplankton species or 16 functional groups, across the visible and near-infrared range. The ranges of water constituent concentrations are chosen to represent natural variability globally, and the corresponding reflectances are computed via physics-based models.
In Pixxel’s algorithm, this synthetic library augments the training dataset, ensuring that the model is exposed to a wider spectrum of water quality scenarios than those captured in GLORIA alone. Extreme or under-represented cases (e.g. very high phycocyanin from intense cyanobacteria blooms, unusual water optical types, or rare combinations of low chlorophyll but high turbidity, etc.) can be included via simulation. The result is a more comprehensive training set that covers clear water to extremely turbid or algal-rich water, improving the algorithm’s generalization. By combining GLORIA’s real-world data with SWIPE’s simulated spectra, Pixxel builds a model that is robust across conditions and not biased only toward the environments measured in the field. This approach reflects a state-of-the-art methodology in remote sensing of water quality – using both empirical and synthetic data to inform algorithms. It essentially fills in data gaps and provides synthetic “ground truth” for conditions where gathering in-situ data is impractical. [5,6,7,8].
Optically active parameter model
Retrieving optically active water quality parameters, such as chlorophyll-a (Chl-a), total suspended solids (TSS), and the absorption coefficient of colored dissolved organic matter at 440 nm aCDOM(440) from satellite remote sensing data relies on direct physical interactions between light and water constituents. These interactions produce distinct spectral signatures—such as the characteristic green reflectance peak of Chl-a or the high backscattering of TSS in the red and near-infrared (NIR) regions—which can be readily detected by space-borne optical sensors. However, accurate retrieval remains challenging due to the complex, overlapping spectral influences of these constituents, particularly in optically complex Case-2 waters where variations in TSS and CDOM can mask or mimic Chl-a signals. As a result, robust estimation requires advanced modelling approaches capable of decoupling these highly nonlinear spectral interdependencies.
The present work uses a deep learning–based framework to directly retrieve key optical water-quality parameters from surface reflectance data. The proposed approach leverages the strengths of a Mixture Density Network (MDN) to model the nonlinear and probabilistic relationships between satellite observations and water quality dynamics. By mapping multi-band spectral reflectance directly to constituent concentrations, the framework bypasses the limitations of rigid, empirical semi-analytical algorithms and captures the full variance of diverse aquatic environments.
The training pipeline integrates direct in-situ radiometric and biogeochemical measurements from the open-sourced global datasets, which capture the natural bio-geo-optical diversity of coastal and inland water bodies worldwide. To bridge spatial-temporal data gaps and prevent model overfitting to specific geographic regions, this in-situ dataset was supplemented with extensive synthetic data generated using Radiative Transfer Modelling (RTM), including SWIPE and HydroLight. The RTM simulations systematically vary constituent concentrations and viewing geometries, mapping them to exact remote sensing reflectance (Rrs) profiles. Hyperspectral profiles from both in-situ and the RTM data were resampled using the spectral response function (SRF) to match the specific visible and near-infrared (VNIR) band configurations of the Sentinel-2 and Firefly satellites.
Unlike deterministic regression methods or static bio-optical inversions, the present deep-learning framework estimates conditional probability distributions of target variables. This allows the model to capture the inherent measurement uncertainties and multimodal behaviour commonly observed in complex, highly turbid, or eutrophic aquatic systems.
Non-Optically active parameter model
Retrieving non-optically active water quality parameters such as dissolved oxygen (DO), dissolved ammonia (NH₃), pH, and water temperature from satellite remote sensing data is considerably more challenging than estimating optically active constituents like Chl-a, TSS, and CDOM. Optical parameters directly influence the absorption and scattering characteristics of water, producing distinct spectral signatures that can be readily detected by space-borne optical sensors. In contrast, non-optical parameters do not have a direct spectral response in satellite imagery and are instead indirectly linked to environmental and biogeochemical processes. Their variability is influenced by complex interactions among meteorological conditions, hydrodynamics, nutrient cycling, biological activity, and seasonal changes. As a result, retrieving these parameters requires a large number of global training datasets and advanced modelling approaches capable of learning nonlinear and hidden relationships among spectral reflectance, auxiliary environmental variables, and in situ observations.
The present work used a deep learning–based framework to retrieve these key non-optical water quality parameters, combining multispectral reflectance from Sentinel-2 with meteorological variables. The proposed approach leverages the strengths of a Mixture Density Network (MDN) to model the nonlinear and probabilistic relationships between satellite observations and water quality dynamics.
Surface reflectance in visible and near-infrared (VNIR) bands from Sentinel-2 imagery was downloaded and matched with synchronous global in-situ water-quality measurements (N~90k) collected across multiple monitoring stations. Meteorological variables, including air temperature, precipitation, wind speed, humidity, and solar radiation, were integrated to account for atmospheric and hydrological influences on water chemistry. These heterogeneous datasets were temporally aligned and used as input features for training the deep learning model. Unlike deterministic regression methods, the present deep-learning framework estimates conditional probability distributions of target variables, thereby capturing uncertainty and multimodal behaviour commonly observed in environmental systems.
The developed model demonstrated strong predictive capability for all four water quality parameters. Temperature retrieval achieved the lowest error due to its relatively stable spectral and meteorological relationships, while ammonia exhibited greater variability associated with biological and hydrodynamic processes. The inclusion of meteorological predictors significantly improved model performance compared to satellite-reflectance-only approaches, particularly during seasonal transitions and extreme weather conditions. Results indicate that the present deep-learning model effectively captured complex nonlinear interactions and reduced prediction uncertainty relative to conventional machine learning models such as Random Forest and standard Artificial Neural Networks.
Inland/Coastal Focus and Satellite Validation
The overall algorithm is specifically tailored to inland and coastal waters, which are optically complex environments (sometimes called “Case 2” waters in remote sensing). Unlike the open ocean (Case 1) where phytoplankton largely dominate optical properties, inland and near-shore waters often have a mix of algae, sediments, and dissolved organics influencing the color of the water. Pixxel’s solution focuses on these challenging waters by training on datasets (GLORIA and SWIPE) that were explicitly designed to represent coastal and inland optical diversity. This focus ensures that the model is sensitive to signals of local eutrophication, sediment plumes, or runoff – critical for water quality monitoring of lakes, rivers, reservoirs, and estuaries.
Crucially, Pixxel validates its water quality algorithm using satellite and airborne imagery to demonstrate real-world performance. After training the boosted Extra Trees model on the combination of field and synthetic data, the team tests it on actual hyperspectral images of water bodies. For instance, the model can be applied to Pixxel’s own hyperspectral satellite data or proxy data from similar sensors, producing maps of chlorophyll, TSS, CDOM, and phycocyanin across a scene. These retrievals are then compared against known ground-truth measurements or well-studied events for validation. The inclusion of SWIPE’s physics-based spectra in training helps the model generalize across different sensors and observation conditions. In fact, the approach has sensor-agnostic qualities – the same trained model can be used on data from different hyperspectral instruments (satellites or airborne) with consistent results. NASA’s research has shown that applying models trained on synthetic and in situ data to new-generation hyperspectral sensors yields reliable water quality products across sensors, and Pixxel follows this best practice. During validation, the algorithm successfully reproduces spatial patterns and magnitudes of water quality parameters in satellite imagery. For example, it can identify algal bloom hot spots in a lake by elevated chlorophyll-a and phycocyanin estimates, delineate sediment-laden plumes in a river mouth via high TSS values, or map CDOM-rich humic stained waters in wetlands. The correspondence between the model’s outputs and validation data (from field sampling or other established products) gives confidence in its accuracy. By iteratively validating and refining the model, Pixxel ensures the algorithm meets the rigor required by technical clients – providing quantitative, reliable retrievals rather than just qualitative indicators.
Overall, Pixxel’s water quality algorithm exemplifies a rigorous, modern approach to remote sensing analytics. It combines a globally diverse field dataset (GLORIA) with a synthetic spectral library (SWIPE) to capture the full variability of inland and coastal water optics, uses a boosted ensemble regression model (Extra Trees) to robustly learn the reflectance-to-quality relationships, and is finely tuned and verified with real satellite imagery. This methodology allows technical users to trust that Pixxel’s hyperspectral imagery can be converted into actionable water quality information – from tracking nutrient-driven algal blooms (chlorophyll, phycocyanin) to monitoring turbidity and sediment flux (TSS) and dissolved organic matter – with scientific-grade confidence and detail. The result is a powerful tool for environmental monitoring agencies, researchers, and water resource managers to assess and manage water quality at scale using Pixxel’s high-resolution hyperspectral satellite constellation.
References
- https://www.nature.com/articles/s41597-023-01973-y
- https://www.sciencedirect.com/science/article/pii/S0034425719306248
- https://www.sciencedirect.com/science/article/abs/pii/S0034425720303448
- https://www.sciencedirect.com/science/article/pii/S0034425721005800
- https://ntrs.nasa.gov/citations/20220017490
- https://www.nasa.gov/nasa-earth-exchange-nex/new-missions-support/aquatic-science-in-the-hyperspectral-era
- https://www.frontiersin.org/journals/environmental-science/articles/10.3389/fenvs.2021.587660/full
- https://www.nature.com/articles/s41597-023-02310-z
- https://www.sciencedirect.com/science/article/pii/S2590123024017778