Smajilbasic, Fadil and Grob, Nils-Robin and Hersche, Matthias (2025) Cloud-Optimized OSM GeoParquet Data Service for Switzerland and Beyond. Other thesis, OST Ostschweizer Fachhochschule.
FS 2025-BA-EP-Smajilbasic-Grob-Hersche-Cloud-Optimized OSM GeoParquet Data Service for Switzerland.pdf - Supplemental Material
Download (76MB)
Abstract
OpenStreetMap (OSM) is one of the most comprehensive openly licensed geospatial vector datasets, containing an estimated 60–90 million points of interest (POIs). While this is comparable to Overture Maps ~61 million POIs, OSM distinguishes itself through its data richness, openness, and crowdsourced quality assurance. However, its raw structure based on a graph of nodes, ways, and relations combined with a flexible tagging system, presents significant challenges for scalable querying and analysis.
This thesis presents a reproducible, open-source pipeline designed to transform country-scale OSM extracts, such as those from Geofabrik, into simplified, analysis-ready GeoParquet files. The files are aligned with Overture Maps Places and Divisions themes and converted into a tabular format optimized for geographic information systems. The solution is built on a modular Extract–Transform–Load (ETL) architecture using osm2pgsql with Lua scripts, PostgreSQL/PostGIS for schema alignment and spatial processing, and DuckDB with PyArrow for high-performance GeoParquet conversion.
Multiple spatial file partitioning strategies, including KDB Tree and S2, were evaluated to support efficient downstream interoperability and client-side filtering. The pipeline operates as a CI/CD enabled DataOps workflow, orchestrated via GitLab, containerized with Docker, and hosted on S3-compatible MinIO storage. A vandalism detection module prototype supports data quality by flagging anomalies in stable administrative names.
The result is Cadence Maps, a fully automated and publicly accessible data service for the D-A-CH-LI region, updated weekly and accompanied by release documentation. GeoParquet files can be queried directly via DuckDB or QGIS without requiring full downloads. For example, queries can filter specific features such as restaurants using hive-compatible S3 prefixes. Full dataset updates can be completed in under 24 hours, demonstrating the system's performance and scalability. This work establishes a reliable and extensible framework for delivering cloud-native geospatial data services with global potential.
| Item Type: | Thesis (Other) |
|---|---|
| Subjects: | Area of Application > GIS Area of Application > GIS > OpenStreetMap Technologies > Programming Languages > Python Technologies > Databases Technologies > Parallel Computing Metatags > IFS (Institute for Software) |
| Divisions: | Bachelor of Science FHO in Informatik > Bachelor Thesis |
| Depositing User: | OST Deposit User |
| Date Deposited: | 29 Sep 2025 10:49 |
| Last Modified: | 29 Sep 2025 10:49 |
| URI: | https://eprints.ost.ch/id/eprint/1307 |
