# GeoPre: Geospatial Data Processing Toolkit
**GeoPre** is a Python library designed to streamline common geospatial data operations, offering a unified interface for handling raster and vector datasets. It simplifies preprocessing tasks essential for GIS analysis, machine learning workflows, and remote sensing applications.
### Key Features
- **Data Scaling**:
- Normalization (Z-Score) and Min-Max scaling for raster bands.
- Prepares data for ML models while preserving geospatial metadata.
- **CRS Management**:
- Retrieve and compare Coordinate Reference Systems (CRS) across raster (Rasterio/Xarray) and vector (GeoPandas) datasets.
- Ensure consistency between datasets with automated CRS checks.
- **Reprojection**:
- Reproject vector data (GeoDataFrames) and raster data (Rasterio/Xarray) to any target CRS.
- Supports EPSG codes, WKT, and Proj4 strings.
- **No-Data Masking**:
- Handle missing values in raster datasets (NumPy/Xarray) with flexible masking.
- Integrates seamlessly with raster metadata for error-free workflows.
- **Cloud Masking**:
- Identify and mask clouds in Sentinel-2 and Landsat imagery.
- Supports multiple methods: QA bands, scene classification layers (SCL), probability bands, and OmniCloudMask AI-based detection.
- Optionally mask cloud shadows for improved accuracy.
- **Band Stacking**:
- Stack multiple raster bands from a folder into a single multi-band raster for analysis.
- Supports automatic band detection and resampling for different resolutions.
### Supported Data Types
- **Raster**: NumPy arrays, Rasterio `DatasetReader`, Xarray `DataArray` (via rioxarray).
- **Vector**: GeoPandas `GeoDataFrame`.
### Benefits of GeoPre
- **Unified Workflow**: Eliminates boilerplate code by providing consistent functions for raster and vector data.
- **Interoperability**: Bridges gaps between GeoPandas, Rasterio, and Xarray, ensuring smooth data transitions.
- **Robust Error Handling**: Automatically detects CRS mismatches and missing metadata to prevent silent failures.
- **Efficiency**: Optimized reprojection and masking operations reduce preprocessing time for large datasets.
- **ML-Ready Outputs**: Scaling functions preserve data structure, making outputs directly usable in machine learning pipelines.
Ideal for researchers and developers working with geospatial data, **GeoPre** enhances productivity by standardizing preprocessing steps and ensuring compatibility across diverse geospatial tools.
## Installation
Ensure you have the required dependencies installed before using this library:
pip install numpy geopandas rasterio rioxarray xarray pyproj
## Usage
### 1. Data Scaling
#### `Z-Score Scaling`
**Description**:This method centers the data around zero by subtracting the mean and dividing by the standard deviation, which is useful for machine learning models sensitive to outliers
and can standardize a band of pixel values for clustering/classification.
- data (numpy.ndarray): Input array to normalize.
- numpy.ndarray: Standardized data with mean 0 and standard deviation 1.
#### `Min_Max_Scaling`
**Description**: This method scales the pixel values to a fixed range, typically [0, 1] or [-1, 1]. Ideal when you want to preserve the relative range of values.
For GeoTIFF image values (e.g., 0 to 65535), scale them to [0, 1].
- data (numpy.ndarray): Input array to normalize.
- numpy.ndarray: Scaled data with values between 0 and 1, or -1 and 1.
#### Example:
import numpy as np
from scaling_and_reproject import Z_score_scaling, Min_Max_Scaling
data = np.array([[10, 20, 30], [40, 50, 60]])
z_scaled = Z_score_scaling(data)
minmax_scaled = Min_Max_Scaling(data)
### 2. CRS Management
#### `get_crs`
**Description**: Retrieve CRS from geospatial data objects.
- data: GeoPandas GeoDataFrames (vector), Rasterio DatasetReaders (raster) or Xarray DataArrays with rio accessor (raster)
- pyproj.CRS: Coordinate reference system or None if undefined
#### `compare_crs`
**Description**: Compare CRS between raster and vector datasets.
- raster_obj (DatasetReader/xarray.DataArray): Raster data source.
- vector_gdf (gpd.GeoDataFrame): Vector data source.
**dict**: Comparison results with keys:
- raster_crs: Formatted CRS string
- vector_crs: Formatted CRS string
- same_crs: Boolean comparison result
- error: Exception message if any
#### Example:
import geopandas as gpd
import rasterio
from scaling_and_reproject import get_crs, compare_crs
vector = gpd.read_file("data.shp")
raster = rasterio.open("image.tif")
print(get_crs(vector)) # EPSG:4326
print(compare_crs(raster, vector)) # CRS comparison results
### 3. Reprojection
#### `reproject_data`
**Description**: Reproject geospatial data to target CRS.
- data: GeoDataFrames (vector reprojection), or Rasterio datasets (returns array + metadata), or Xarray objects (rioxarray reprojection)
- target_crs: CRS to reproject to (EPSG code/WKT/proj4 string)
- Reprojected data in format matching input type
#### Example:
import rasterio
import xarray as xr
from scaling_and_reproject import reproject_data
# Vector reprojection
reprojected_vector = reproject_data(vector, "EPSG:3857")
# Raster reprojection (Rasterio)
with rasterio.open("input.tif") as src:
array, metadata = reproject_data(src, "EPSG:32633")
# Xarray reprojection
da = xr.open_rasterio("image.tif")
reprojected_da = reproject_data(da, "EPSG:4326")
### 4. No-Data Masking
#### `mask_raster_data`
**Description**: Mask no-data values in raster datasets. Handles both rasterio (numpy) and rioxarray (xarray) workflows.
- data: Raster data (numpy.ndarray or xarray.DataArray)
- profile: Rasterio metadata dict (required for numpy arrays)
- no_data_value: Override for metadata's nodata value
- return_mask: Whether to return boolean mask
- Masked data array. For numpy inputs, returns tuple:(masked_array, profile). For xarray, returns DataArray.
#### Example:
import xarray as xr
import rasterio
from scaling_and_reproject import mask_raster_data
# Rasterio workflow
with rasterio.open("data.tif") as src:
data = src.read(1)
masked, profile = mask_raster_data(data, src.profile)
# rioxarray workflow
da = xr.open_rasterio("data.tif")
masked_da = mask_raster_data(da)
### 5. Cloud Masking
#### `mask_clouds_S2`
**Description**: Masks clouds and optionally shadows in a Sentinel-2 raster image using various methods.
- `image_path` *(str)*: Path to the input raster image.
- `output_path` *(str, optional)*: Path to save the masked output raster. Defaults to the same directory as the input with '_masked' appended to the filename.
- `method` *(str, optional)*: The method for masking. Options are:
- `'auto'`: Automatically chooses the best available method.
- `'qa'`: Uses the QA60 band to mask clouds. WARNING: QA60 is deprecated after 2022-01-25, results for images after that date could be wrong
- `'probability'`: Uses the cloud probability band MSK_CLDPRB with a threshold for masking.
- `'omnicloudmask'`: Utilizes OmniCloudMask for AI-based cloud detection. Might take a long time for big images
- `'scl'`: Leverages the Scene Classification Layer (SCL) for masking.
- `'standard'`: Similar to 'auto', but avoids the OmniCloudMask method.
- `mask_shadows` *(bool)*: Whether to mask cloud shadows. Defaults to `False`.
- `threshold` *(int, optional)*: Cloud probability threshold (if using a cloud probability band), from 0 to 100. Defaults to `20`.
- `qa60_idx` *(int, optional)*: Index of the QA60 band (1-based). Auto-detected if not provided.
- `qa60_path` *(str, optional)*: Path to the QA60 band (if in a separate file).
- `prob_band_idx` *(int, optional)*: Index of the cloud probability band (1-based). Auto-detected if not provided.
- `prob_band_path` *(str, optional)*: Path to the cloud probability band (if in a separate file).
- `scl_idx` *(int, optional)*: Index of the SCL band (1-based). Auto-detected if not provided.
- `scl_path` *(str, optional)*: Path to the SCL band (if in a separate file).
- `red_idx`, `green_idx`, `nir_idx` *(int, optional)*: Indices of the red, green, and NIR bands, respectively. Auto-detected if not provided.
- `nodata_value` *(float)*: Value for no-data regions. Defaults to `np.nan`.
- *(str)*: The path to the saved masked output raster.
#### Example:
from cloud_masking import mask_clouds_S2
output_s2 = mask_clouds_S2("sentinel2_image.tif", method='auto', mask_shadows=True)
#### `mask_clouds_landsat`
Masks clouds and optionally shadows in a Landsat raster image using various methods.
- **`image_path`** *(str)*: Path to the input multi-band raster image.
- **`output_path`** *(str, optional)*: Path to save the masked output raster. Defaults to the same directory as the input with `_masked` suffix.
- **`method`** *(str)*: The method for masking. Options are:
- **`'auto'`**: Automatically chooses the best available method.
- **`'qa'`**: Uses the QA_PIXEL band to mask clouds.
- **`'omnicloudmask'`**: Utilizes OmniCloudMask for AI-based cloud detection.
- **`mask_shadows`** *(bool)*: Whether to mask cloud shadows. Defaults to `False`.
- **`qa_pixel_path`** *(str, optional)*: Path to the separate QA_PIXEL raster file.
- **`qa_pixel_idx`** *(int, optional)*: Index of the QA_PIXEL band (1-based).
- **`confidence_threshold`** *(str, optional)*: Confidence threshold for cloud masking (e.g., `'Low'`, `'Medium'`, `'High'`). Defaults to `'High'`. WARNING: as per the Landsat official documentation, the confidence bands are still under development, always use the default 'High' untill further notice. [Source](https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/media/files/LSDS-1619_Landsat8-9-Collection2-Level2-Science-Product-Guide-v6.pdf)
- **`red_idx`**, **`green_idx`**, **`nir_idx`** *(int, optional)*: Indices of the red, green, and NIR bands, respectively. Auto-detected if not provided.
- **`nodata_value`** *(float)*: Value for no-data regions. Defaults to `np.nan`.
### Returns
- *(str)*: The path to the saved masked output raster.
### Example
from cloud_masking import mask_clouds_landsat
output_landsat = mask_clouds_landsat("landsat_image.tif", method='auto', mask_shadows=True)
## 6. Band Stacking
### `stack_bands`
Stacks multiple raster bands from a folder into a single multi-band raster. Support also .SAFE folders.
### Parameters
- **`input_path`** *(str or Path)*: Path to the folder containing band files.
- **`required_bands`** *(list of str)*: List of band name identifiers (e.g., `["B4", "B3", "B2"]`).
- **`output_path`** *(str or Path, optional)*: Path to save the stacked raster. Defaults to `"stacked.tif"` in the input folder.
- **`resolution`** *(float, optional)*: Target resolution for resampling. Defaults to the highest available resolution.
### Returns
- *(str)*: The path to the saved stacked output raster.
### Example
from stacking import stack_bands
stacked_image = stack_bands("/path/to/folder/containing/bands", ["B4", "B3", "B2"])
## Contributing
1. **Fork the repository**
Click the "Fork" button at the top-right of this repository to create your copy.
2. **Create your feature branch**
git checkout -b feature/your-feature
3. **Commit changes**
git commit -am 'Add some feature'
4. **Push to branch**
git push origin feature/your-feature
5. **Open a Pull Request**
Navigate to the Pull Requests tab in the original repository and click "New Pull Request" to submit your changes.
## License
This project is licensed under the MIT License. See LICENSE for more information.
## Author
Liang Zhongyou – [GitHub Profile](https://github.com/zyl009)
Matteo Gobbi Frattini – [GitHub Profile](https://github.com/MatteoGobbiF)
