Pixxel’s Water Quality Retrieval Algorithm
GLORIA Field Dataset for water quality
A key foundation of Pixxel’s water quality algorithm is the use of GLORIA, a globally representative in situ hyperspectral dataset. GLORIA (Global Reflectance community dataset for Imaging and optical sensing of Aquatic environments) compiles 7,572 curated water reflectance spectra measured at 1 nm intervals from 350–900 nm, collected from inland and coastal waters worldwide. Each spectrum in GLORIA is paired with field-based water quality measurements, providing ground truth for model training and validation. Notably, the dataset includes co-located values for chlorophyll-a (a proxy for algal biomass), total suspended solids (TSS), absorption by colored dissolved organic matter (CDOM), and water clarity metrics like Secchi depth. (Chlorophyll-a and CDOM absorption are indicators of phytoplankton density and dissolved organic content respectively, while TSS relates to turbidity.) By capturing contributions from 450 different water bodies across a range of trophic and optical conditions, GLORIA provides a robust, real-world basis to train algorithms. This field dataset spans a diversity of water types – from clear oligotrophic lakes to turbid, eutrophic estuaries – ensuring that models learn the bio-optical variability characteristic of inland and near-shore waters. The rich spectral-resolution of GLORIA’s reflectance (hyperspectral Rrs measurements) and its comprehensive documentation make it an ideal reference for remote sensing of water quality. Using GLORIA, the Pixxel algorithm benefits from ground-truth spectra and water constituent values needed to calibrate and validate predictive models. The GLORIA dataset has also been used to develop state-of-the-art global water quality models using multispectral sensors. [1,2,3,4]
SWIPE Synthetic Spectral Library for Data Augmentation
In addition to field measurements, Pixxel’s approach leverages synthetic spectral data to broaden the range of conditions the model can handle. For this, it utilizes principles from NASA’s SWIPE (Spectral Water Inversion Processor and Emulator) framework – a synthetic hyperspectral library designed for aquatic remote sensing. SWIPE is essentially a high-fidelity spectral simulator that generates realistic water reflectance spectra under diverse environmental conditions. It uses advanced radiative transfer and particle optical modeling (incorporating known optical properties of water constituents) to produce large volumes of simulated reflectance data. This includes variations in phytoplankton species and pigment composition, suspended sediment concentrations, and CDOM levels, as well as different water depths and atmospheric conditions. By leveraging recent advancements in optical modeling and big-data analytics, SWIPE creates a “synthetic training ground” for algorithm development. In practice, this means thousands of hyperspectral spectra can be synthesized for combinations of chlorophyll, TSS, CDOM, and even specific algal pigments beyond what might be available from field campaigns. For example, SWIPE’s engine can rapidly generate tens of thousands of 1-nm resolution reflectance spectra spanning 80 phytoplankton species or 16 functional groups, across the visible and near-infrared range. The ranges of water constituent concentrations are chosen to represent natural variability globally, and the corresponding reflectances are computed via physics-based models.
In Pixxel’s algorithm, this synthetic library augments the training dataset, ensuring that the model is exposed to a wider spectrum of water quality scenarios than those captured in GLORIA alone. Extreme or under-represented cases (e.g. very high phycocyanin from intense cyanobacteria blooms, unusual water optical types, or rare combinations of low chlorophyll but high turbidity, etc.) can be included via simulation. The result is a more comprehensive training set that covers clear water to extremely turbid or algal-rich water, improving the algorithm’s generalization. By combining GLORIA’s real-world data with SWIPE’s simulated spectra, Pixxel builds a model that is robust across conditions and not biased only toward the environments measured in the field. This approach reflects a state-of-the-art methodology in remote sensing of water quality – using both empirical and synthetic data to inform algorithms. It essentially fills in data gaps and provides synthetic “ground truth” for conditions where gathering in-situ data is impractical. [5,6,7,8].
Boosted Extra Trees Regression Model
To translate reflectance spectra into quantitative water quality parameters, Pixxel’s algorithm employs a boosted Extra Trees regressor as the modeling approach. Extra Trees (Extremely Randomized Trees) is an ensemble machine learning method related to Random Forests. Like Random Forest, it builds a large collection of decision trees and averages their predictions, but Extra Trees injects additional randomness (e.g. randomizing split thresholds) to reduce overfitting and improve generalization. In effect, an Extra Trees Regressor constructs numerous unpruned decision trees from the training data and makes predictions by averaging the outcomes. This ensemble-of-trees approach is well-suited for hyperspectral data: it can handle high-dimensional inputs (dozens to hundreds of spectral bands) and capture complex, nonlinear relationships between reflectance features and water constituent concentrations. By averaging many randomized trees, the model achieves greater stability and accuracy than any individual tree, often outperforming traditional regression or even standard random forests in this domain. Boosting further enhances this model. In a boosted Extra Trees regressor, the Extra Trees ensemble is used as a base learner within a boosting framework (such as AdaBoost or similar gradient boosting). Boosting means that multiple rounds of the model are trained sequentially, each one paying more attention to the errors of the previous, thereby refining performance. By using Extra Trees as the base estimator in a boosted ensemble, Pixxel’s algorithm achieves higher predictive accuracy and robustness. Studies have found that using Extra Trees within a boosting scheme can significantly improve water quality prediction accuracy. In essence, the Pixxel model benefits from both bagging-style averaging (within the Extra Trees) and boosting (across iterations of ensembles), capturing subtle spectral signals of water quality.
During training, the input to the regressor is the reflectance spectrum (either in situ Rrs from GLORIA or simulated Rrs from SWIPE, potentially after appropriate preprocessing like normalization or feature selection), and the outputs are the estimated values of water quality parameters (chlorophyll-a, TSS, CDOM absorption, phycocyanin concentration, etc.). The model is trained and tuned using a portion of the combined dataset, with the remainder used for cross-validation. Hyperparameters (such as number of trees, tree depth, learning rate for boosting, etc.) are optimized to minimize prediction error while avoiding overfitting. The boosted Extra Trees regressor’s strength lies in its ability to learn the spectral “fingerprints” of different water constituents – for example, the chlorophyll-a absorption features around 440 nm and 670 nm, the phycocyanin feature near 620 nm, or the broad scattering-driven shape changes due to TSS – and quantitatively relate those to concentration values. [9, 10]
Inland/Coastal Focus and Satellite Validation
The overall algorithm is specifically tailored to inland and coastal waters, which are optically complex environments (sometimes called “Case 2” waters in remote sensing). Unlike the open ocean (Case 1) where phytoplankton largely dominate optical properties, inland and near-shore waters often have a mix of algae, sediments, and dissolved organics influencing the color of the water. Pixxel’s solution focuses on these challenging waters by training on datasets (GLORIA and SWIPE) that were explicitly designed to represent coastal and inland optical diversity. This focus ensures that the model is sensitive to signals of local eutrophication, sediment plumes, or runoff – critical for water quality monitoring of lakes, rivers, reservoirs, and estuaries.
Crucially, Pixxel validates its water quality algorithm using satellite and airborne imagery to demonstrate real-world performance. After training the boosted Extra Trees model on the combination of field and synthetic data, the team tests it on actual hyperspectral images of water bodies. For instance, the model can be applied to Pixxel’s own hyperspectral satellite data or proxy data from similar sensors, producing maps of chlorophyll, TSS, CDOM, and phycocyanin across a scene. These retrievals are then compared against known ground-truth measurements or well-studied events for validation. The inclusion of SWIPE’s physics-based spectra in training helps the model generalize across different sensors and observation conditions. In fact, the approach has sensor-agnostic qualities – the same trained model can be used on data from different hyperspectral instruments (satellites or airborne) with consistent results. NASA’s research has shown that applying models trained on synthetic and in situ data to new-generation hyperspectral sensors yields reliable water quality products across sensors, and Pixxel follows this best practice. During validation, the algorithm successfully reproduces spatial patterns and magnitudes of water quality parameters in satellite imagery. For example, it can identify algal bloom hot spots in a lake by elevated chlorophyll-a and phycocyanin estimates, delineate sediment-laden plumes in a river mouth via high TSS values, or map CDOM-rich humic stained waters in wetlands. The correspondence between the model’s outputs and validation data (from field sampling or other established products) gives confidence in its accuracy. By iteratively validating and refining the model, Pixxel ensures the algorithm meets the rigor required by technical clients – providing quantitative, reliable retrievals rather than just qualitative indicators.
Overall, Pixxel’s water quality algorithm exemplifies a rigorous, modern approach to remote sensing analytics. It combines a globally diverse field dataset (GLORIA) with a synthetic spectral library (SWIPE) to capture the full variability of inland and coastal water optics, uses a boosted ensemble regression model (Extra Trees) to robustly learn the reflectance-to-quality relationships, and is finely tuned and verified with real satellite imagery. This methodology allows technical users to trust that Pixxel’s hyperspectral imagery can be converted into actionable water quality information – from tracking nutrient-driven algal blooms (chlorophyll, phycocyanin) to monitoring turbidity and sediment flux (TSS) and dissolved organic matter – with scientific-grade confidence and detail. The result is a powerful tool for environmental monitoring agencies, researchers, and water resource managers to assess and manage water quality at scale using Pixxel’s high-resolution hyperspectral satellite constellation.
References
- https://www.nature.com/articles/s41597-023-01973-y
- https://www.sciencedirect.com/science/article/pii/S0034425719306248
- https://www.sciencedirect.com/science/article/abs/pii/S0034425720303448
- https://www.sciencedirect.com/science/article/pii/S0034425721005800
- https://ntrs.nasa.gov/citations/20220017490
- https://www.nasa.gov/nasa-earth-exchange-nex/new-missions-support/aquatic-science-in-the-hyperspectral-era
- https://www.frontiersin.org/journals/environmental-science/articles/10.3389/fenvs.2021.587660/full
- https://www.nature.com/articles/s41597-023-02310-z
- https://www.sciencedirect.com/science/article/pii/S2590123024017778