• Water quality refers to the chemical, physical, and biological characteristics of water. It is a crucial aspect of the environment, public health, and various industries that rely on water for different purposes. Water quality is assessed using suitable mathematical/computational models to predict the behaviour of specific water quality parameters in a water body and determine if water is ideal for various uses, such as drinking, agriculture, industrial processes, and supporting aquatic ecosystems.

  • Notable water quality parameters include pH, Dissolved Oxygen (DO), Nutrients (nitrogen and phosphorus), Biological Oxygen Demand (BOD), Chemical Oxygen Demand (COD), Heavy metal, Coliform bacteria, Chlorophyll-a concentration (Chl-a), absorption coefficient of Colored Dissolved Organic Matter (aCDOM) and Total Suspended Solids (TSS).

  • Only optically active parameters such as Chl-a (in mg m-3), TSS (in g m-3), and aCDOM(440) (in m-1) are considered valid inputs to the present Water Quality Index model and estimated directly from satellite-derived Level-2 water reflectance images after applying suitable bio-optical/machine learning algorithms. These three parameters are then used for the calculation of the Trophic State Index (TSI), TSS Index (TI), and CDOM Index (CI). Later, TSI, TI and CI are supplied to the Fuzzy Inference System (FIS) model as inputs to generate the Water Quality Index (WQI) raster image.

  • The calculated TSI, TI, CI, and WQI rasters will have values between 0-100 (Unitless), and the significance of these indexes for analysing water quality can be easily interpreted from table below.

TSIWater type (based on TSI range)TIWater type (based on TI range)CIWater type (based on CI range)WQIWater quality (based on WQI)
0–30Ultra-Oligotrophic0–40Low turbid0–41Low CDOM80–100Good
30–40Oligotrophic40–54Moderate turbid41–47Moderate CDOM60–80Average
40–50Mesotrophic54–60High turbid47–77High CDOM40–60Poor
50–60Eutrophic60–100Very high turbid77–100Very high CDOM20–40Very Poor
60–100Hypertrophic0–20Extremely Poor

Pixxel’s Water Quality Retrieval Algorithm

GLORIA Field Dataset for water quality

A key foundation of Pixxel’s water quality algorithm is the use of GLORIA, a globally representative in situ hyperspectral dataset. GLORIA (Global Reflectance community dataset for Imaging and optical sensing of Aquatic environments) compiles 7,572 curated water reflectance spectra measured at 1 nm intervals from 350–900 nm, collected from inland and coastal waters worldwide. Each spectrum in GLORIA is paired with field-based water quality measurements, providing ground truth for model training and validation. Notably, the dataset includes co-located values for chlorophyll-a (a proxy for algal biomass), total suspended solids (TSS), absorption by colored dissolved organic matter (CDOM), and water clarity metrics like Secchi depth. (Chlorophyll-a and CDOM absorption are indicators of phytoplankton density and dissolved organic content respectively, while TSS relates to turbidity.) By capturing contributions from 450 different water bodies across a range of trophic and optical conditions, GLORIA provides a robust, real-world basis to train algorithms. This field dataset spans a diversity of water types – from clear oligotrophic lakes to turbid, eutrophic estuaries – ensuring that models learn the bio-optical variability characteristic of inland and near-shore waters. The rich spectral-resolution of GLORIA’s reflectance (hyperspectral Rrs measurements) and its comprehensive documentation make it an ideal reference for remote sensing of water quality. Using GLORIA, the Pixxel algorithm benefits from ground-truth spectra and water constituent values needed to calibrate and validate predictive models. The GLORIA dataset has also been used to develop state-of-the-art global water quality models using multispectral sensors. [1,2,3,4]

SWIPE Synthetic Spectral Library for Data Augmentation

In addition to field measurements, Pixxel’s approach leverages synthetic spectral data to broaden the range of conditions the model can handle. For this, it utilizes principles from NASA’s SWIPE (Spectral Water Inversion Processor and Emulator) framework – a synthetic hyperspectral library designed for aquatic remote sensing. SWIPE is essentially a high-fidelity spectral simulator that generates realistic water reflectance spectra under diverse environmental conditions. It uses advanced radiative transfer and particle optical modeling (incorporating known optical properties of water constituents) to produce large volumes of simulated reflectance data. This includes variations in phytoplankton species and pigment composition, suspended sediment concentrations, and CDOM levels, as well as different water depths and atmospheric conditions. By leveraging recent advancements in optical modeling and big-data analytics, SWIPE creates a “synthetic training ground” for algorithm development. In practice, this means thousands of hyperspectral spectra can be synthesized for combinations of chlorophyll, TSS, CDOM, and even specific algal pigments beyond what might be available from field campaigns. For example, SWIPE’s engine can rapidly generate tens of thousands of 1-nm resolution reflectance spectra spanning 80 phytoplankton species or 16 functional groups, across the visible and near-infrared range. The ranges of water constituent concentrations are chosen to represent natural variability globally, and the corresponding reflectances are computed via physics-based models.

In Pixxel’s algorithm, this synthetic library augments the training dataset, ensuring that the model is exposed to a wider spectrum of water quality scenarios than those captured in GLORIA alone. Extreme or under-represented cases (e.g. very high phycocyanin from intense cyanobacteria blooms, unusual water optical types, or rare combinations of low chlorophyll but high turbidity, etc.) can be included via simulation. The result is a more comprehensive training set that covers clear water to extremely turbid or algal-rich water, improving the algorithm’s generalization. By combining GLORIA’s real-world data with SWIPE’s simulated spectra, Pixxel builds a model that is robust across conditions and not biased only toward the environments measured in the field. This approach reflects a state-of-the-art methodology in remote sensing of water quality – using both empirical and synthetic data to inform algorithms. It essentially fills in data gaps and provides synthetic “ground truth” for conditions where gathering in-situ data is impractical. [5,6,7,8].

Boosted Extra Trees Regression Model

To translate reflectance spectra into quantitative water quality parameters, Pixxel’s algorithm employs a boosted Extra Trees regressor as the modeling approach. Extra Trees (Extremely Randomized Trees) is an ensemble machine learning method related to Random Forests. Like Random Forest, it builds a large collection of decision trees and averages their predictions, but Extra Trees injects additional randomness (e.g. randomizing split thresholds) to reduce overfitting and improve generalization. In effect, an Extra Trees Regressor constructs numerous unpruned decision trees from the training data and makes predictions by averaging the outcomes. This ensemble-of-trees approach is well-suited for hyperspectral data: it can handle high-dimensional inputs (dozens to hundreds of spectral bands) and capture complex, nonlinear relationships between reflectance features and water constituent concentrations. By averaging many randomized trees, the model achieves greater stability and accuracy than any individual tree, often outperforming traditional regression or even standard random forests in this domain. Boosting further enhances this model. In a boosted Extra Trees regressor, the Extra Trees ensemble is used as a base learner within a boosting framework (such as AdaBoost or similar gradient boosting). Boosting means that multiple rounds of the model are trained sequentially, each one paying more attention to the errors of the previous, thereby refining performance. By using Extra Trees as the base estimator in a boosted ensemble, Pixxel’s algorithm achieves higher predictive accuracy and robustness. Studies have found that using Extra Trees within a boosting scheme can significantly improve water quality prediction accuracy. In essence, the Pixxel model benefits from both bagging-style averaging (within the Extra Trees) and boosting (across iterations of ensembles), capturing subtle spectral signals of water quality.

During training, the input to the regressor is the reflectance spectrum (either in situ Rrs from GLORIA or simulated Rrs from SWIPE, potentially after appropriate preprocessing like normalization or feature selection), and the outputs are the estimated values of water quality parameters (chlorophyll-a, TSS, CDOM absorption, phycocyanin concentration, etc.). The model is trained and tuned using a portion of the combined dataset, with the remainder used for cross-validation. Hyperparameters (such as number of trees, tree depth, learning rate for boosting, etc.) are optimized to minimize prediction error while avoiding overfitting. The boosted Extra Trees regressor’s strength lies in its ability to learn the spectral “fingerprints” of different water constituents – for example, the chlorophyll-a absorption features around 440 nm and 670 nm, the phycocyanin feature near 620 nm, or the broad scattering-driven shape changes due to TSS – and quantitatively relate those to concentration values. [9, 10]

Inland/Coastal Focus and Satellite Validation

The overall algorithm is specifically tailored to inland and coastal waters, which are optically complex environments (sometimes called “Case 2” waters in remote sensing). Unlike the open ocean (Case 1) where phytoplankton largely dominate optical properties, inland and near-shore waters often have a mix of algae, sediments, and dissolved organics influencing the color of the water. Pixxel’s solution focuses on these challenging waters by training on datasets (GLORIA and SWIPE) that were explicitly designed to represent coastal and inland optical diversity. This focus ensures that the model is sensitive to signals of local eutrophication, sediment plumes, or runoff – critical for water quality monitoring of lakes, rivers, reservoirs, and estuaries.

Crucially, Pixxel validates its water quality algorithm using satellite and airborne imagery to demonstrate real-world performance. After training the boosted Extra Trees model on the combination of field and synthetic data, the team tests it on actual hyperspectral images of water bodies. For instance, the model can be applied to Pixxel’s own hyperspectral satellite data or proxy data from similar sensors, producing maps of chlorophyll, TSS, CDOM, and phycocyanin across a scene. These retrievals are then compared against known ground-truth measurements or well-studied events for validation. The inclusion of SWIPE’s physics-based spectra in training helps the model generalize across different sensors and observation conditions. In fact, the approach has sensor-agnostic qualities – the same trained model can be used on data from different hyperspectral instruments (satellites or airborne) with consistent results. NASA’s research has shown that applying models trained on synthetic and in situ data to new-generation hyperspectral sensors yields reliable water quality products across sensors, and Pixxel follows this best practice. During validation, the algorithm successfully reproduces spatial patterns and magnitudes of water quality parameters in satellite imagery. For example, it can identify algal bloom hot spots in a lake by elevated chlorophyll-a and phycocyanin estimates, delineate sediment-laden plumes in a river mouth via high TSS values, or map CDOM-rich humic stained waters in wetlands. The correspondence between the model’s outputs and validation data (from field sampling or other established products) gives confidence in its accuracy. By iteratively validating and refining the model, Pixxel ensures the algorithm meets the rigor required by technical clients – providing quantitative, reliable retrievals rather than just qualitative indicators.

Overall, Pixxel’s water quality algorithm exemplifies a rigorous, modern approach to remote sensing analytics. It combines a globally diverse field dataset (GLORIA) with a synthetic spectral library (SWIPE) to capture the full variability of inland and coastal water optics, uses a boosted ensemble regression model (Extra Trees) to robustly learn the reflectance-to-quality relationships, and is finely tuned and verified with real satellite imagery. This methodology allows technical users to trust that Pixxel’s hyperspectral imagery can be converted into actionable water quality information – from tracking nutrient-driven algal blooms (chlorophyll, phycocyanin) to monitoring turbidity and sediment flux (TSS) and dissolved organic matter – with scientific-grade confidence and detail. The result is a powerful tool for environmental monitoring agencies, researchers, and water resource managers to assess and manage water quality at scale using Pixxel’s high-resolution hyperspectral satellite constellation.

Performance Assessment of the Chlorophyll-a estimation model

Chlorophyll-a (Chl-a) (measured in mg m-3) is a crucial indicator of water quality, particularly in assessing the presence and abundance of algae and phytoplankton. Monitoring Chl-a helps in understanding the primary productivity of aquatic ecosystems and the potential for eutrophication.

Reflectance ratios at specific wavelengths are widely used as reliable proxies for estimating Chl-a concentrations, owing to the distinct optical absorption and scattering properties of phytoplankton pigments. In optically clear waters, where interference from suspended particles and colored dissolved organic matter (CDOM) is minimal, the ratio of reflectance in the blue to green spectral regions effectively captures the strong absorption of Chl-a in the blue and minimal absorption in the green, making it a sensitive indicator of Chl-a levels. However, in turbid or highly productive waters, where the optical signal is influenced by higher concentrations of suspended solids and phytoplankton, the blue region is often compromised. In such cases, reflectance ratios involving the red and near-infrared bands are more effective, as Chl-a shows a characteristic absorption in the red (around 665 nm) and enhanced backscattering in the NIR, enabling more accurate quantification of Chl-a under complex water conditions.

The present chlorophyll-a (Chl-a) estimation model, derived using machine learning (ML) techniques, was developed from approximately 14,100 paired observations of in-situ Chl-a concentrations and coincident remote sensing reflectance (Rrs) measurements, spanning five distinct optical water types across global inland and coastal waterbodies (Fig. 1). For model training and evaluation, 80% of the dataset was allocated to training (hereafter referred to as the train dataset), while the remaining 20% was withheld for independent validation (hereafter referred to as the test dataset).

Fig. 1. Median of Normalised Rrs spectra, classified into five different water types with the k-means clustering technique, used for training the Chl-a estimation model.

Figure 2 presents scatter plots illustrating the model-derived Chl-a estimates for both the training and test datasets. Table 1 provides a summary of the model evaluation using seven error metrics: Bias, Intercept, Median Absolute Percentage Difference (MAPD), Normalised Root Mean Squared Error (NRMSE), Normalised Mean Absolute Error (NMAE), coefficient of determination (R²), and slope of the regression line.

Fig. 2. Scatter plot of model-derived Chl-a versus true Chl-a for train datasets (80%) [Left] and test datasets (20%) [Right].

MetricTrain (80%)Test (20%)Desired
Bias0.6491.7340.0
Intercept0.4530.4950.0
MAPD (%)7.02130.285< = 30.0
NMAE0.0010.0080.0
NRMSE0.0070.0260.0
0.9790.868> = 0.8
Slope0.9160.8181.0

Table 1. Performance evaluation of the present Chl-a estimation model against true values.

In conclusion, the Chl-a estimation ML model demonstrates satisfactory performance, with an R² value greater than 0.8 for test datasets. Furthermore, the MAPD of around 30% suggests that the model provides accurate estimates with minimal bias, making it a reliable tool for Chl-a prediction in the various optical water types.

Performance assessment of the Total Suspended Solids estimation model

Total Suspended Solids (TSS) (measured in g m-3) is a key indicator of water quality, particularly in assessing the presence and abundance of suspended particles, primarily inorganic salts and sediments. Monitoring TSS helps in understanding the pollution level due to wastewater discharge or sediment transport.

Reflectance ratios at specific wavelengths are strongly correlated with turbidity, quantified as TSS concentration. In optically clear waters, the ratio of reflectance in the blue and green spectral regions is particularly sensitive to low concentrations of suspended particles. This sensitivity arises because suspended solids preferentially scatter and absorb light in the shorter wavelengths, causing measurable changes in the blue-to-green reflectance ratio. As TSS concentration increases, especially in more turbid waters, the spectral response tends to shift toward longer wavelengths, necessitating the use of red or near-infrared bands for more accurate quantification.

The present TSS estimation model, derived using machine learning (ML) techniques, was developed from approximately 14,600 paired observations of in-situ TSS concentrations and coincident remote sensing reflectance (Rrs) measurements, spanning five distinct optical water types across global inland and coastal waterbodies (Fig. 3). For model training and evaluation, 80% of the dataset was allocated to training (hereafter referred to as the train dataset), while the remaining 20% was withheld for independent validation (hereafter referred to as the test dataset).

Fig. 3. Median of Normalised Rrs spectra, classified into five different water types with the k-means clustering technique, used for training the TSS estimation model.

Figure 4 presents scatter plots illustrating the model-derived TSS estimates for both the training and test datasets. Table 2 provides a summary of the model evaluation using seven error metrics: Bias, Intercept, Median Absolute Percentage Difference (MAPD), Normalised Root Mean Squared Error (NRMSE), Normalised Mean Absolute Error (NMAE), coefficient of determination (R²), and slope of the regression line.

Fig. 4. Scatter plot of model-derived TSS versus true TSS for train datasets (80%) [Left] and test datasets (20%) [Right].

MetricTrain (80%)Test (20%)Desired
Bias1.0023.0050.0
Intercept0.7680.5610.0
MAPD (%)6.76330.045< = 30.0
NMAE0.0010.0040.0
NRMSE0.0050.0150.0
0.9640.875> = 0.8
Slope0.8810.7691.0

Table 2. Performance evaluation of the present TSS estimation model against true values.

In conclusion, the TSS estimation ML model demonstrates satisfactory performance, with an R² value greater than 0.8 for test datasets. Furthermore, the MAPD of around 30% suggests that the model provides accurate estimates with minimal bias, making it a reliable tool for TSS prediction in the various optical water types.

Performance assessment of the aCDOM(440) estimation model

Absorption coefficient of Colored Dissolved Organic Matter (aCDOM) at 440 nm (measured in m-1) is a key indicator of water quality, particularly in assessing the presence and abundance of decaying organic substances. It provides insights into the concentration of coloured dissolved organic matter, which influences water clarity, ecological health, and biogeochemical cycles.

Spectral reflectance ratios at specific wavelengths strongly correlate with the aCDOM(440). aCDOM(440) primarily influences light absorption in the ultraviolet and blue regions of the spectrum due to the presence of humic and fulvic substances derived from terrestrial and aquatic sources. In particular, the ratio of reflectance in the blue to green regions is sensitive to variations in aCDOM(440), as CDOM strongly absorbs shorter wavelengths while exerting minimal influence on the green region. As aCDOM(440) concentration increases, the absorption in the blue intensifies, leading to a decrease in reflectance in that region and thus altering the blue-to-green ratio. This spectral behaviour enables using blue-green reflectance ratios as effective indicators for estimating aCDOM(440) across diverse aquatic environments.

The present aCDOM(440) estimation model, derived using machine learning (ML) techniques, was developed from approximately 12,100 paired observations of in-situ aCDOM(440) concentrations and coincident remote sensing reflectance (Rrs) measurements, spanning five distinct optical water types across global inland and coastal waterbodies (Fig. 5). For model training and evaluation, 80% of the dataset was allocated to training (hereafter referred to as the train dataset), while the remaining 20% was withheld for independent validation (hereafter referred to as the test dataset).

Fig. 5. Median of Normalised Rrs spectra, classified into five different water types with the k-means clustering technique, used for training the aCDOM(440) estimation model.

Figure 6 presents scatter plots illustrating the model-derived aCDOM(440) estimates for both the training and test datasets. Table 3 provides a summary of the model evaluation using seven error metrics: Bias, Intercept, Median Absolute Percentage Difference (MAPD), Normalised Root Mean Squared Error (NRMSE), Normalised Mean Absolute Error (NMAE), coefficient of determination (R²), and slope of the regression line.

Fig. 6. Scatter plot of model-derived aCDOM(440) versus true aCDOM(440) for train datasets (80%) [Left] and test datasets (20%) [Right].

MetricTrain (80%)Test (20%)Desired
Bias0.1910.1630.0
Intercept0.0920.0190.0
MAPD (%)20.02234.192< = 30.0
NMAE0.0030.0070.0
NRMSE0.0140.0240.0
0.8820.820> = 0.8
Slope0.7570.8491.0

Table 3. Performance evaluation of the present aCDOM(440) estimation model against true values.

In conclusion, the aCDOM(440) estimation ML model demonstrates satisfactory performance, with an R² value greater than 0.8 for test datasets. However, the MAPD calculated for the test dataset is marginally higher than the desired limit of 30% error. This increased error is mainly due to the absence of multiple green bands in the Sentinel-2 sensor, which restricts the model’s ability to accurately detect variations in CDOM absorption, especially in the green spectral region.

References

  1. https://www.nature.com/articles/s41597-023-01973-y
  2. https://www.sciencedirect.com/science/article/pii/S0034425719306248
  3. https://www.sciencedirect.com/science/article/abs/pii/S0034425720303448
  4. https://www.sciencedirect.com/science/article/pii/S0034425721005800
  5. https://ntrs.nasa.gov/citations/20220017490
  6. https://www.nasa.gov/nasa-earth-exchange-nex/new-missions-support/aquatic-science-in-the-hyperspectral-era
  7. https://www.frontiersin.org/journals/environmental-science/articles/10.3389/fenvs.2021.587660/full
  8. https://www.nature.com/articles/s41597-023-02310-z
  9. https://www.sciencedirect.com/science/article/pii/S2590123024017778