Download Parquet Data
Download ski resort datasets in Parquet format and learn how we create them.
How We Create the Data
Built from OpenStreetMap using a multi-step pipeline
Our datasets are built from OpenStreetMap (OSM) using a multi-step pipeline. We use regional PBF extracts from Geofabrik—continental or country-level OSM data—then process each region through an 11-step pipeline.
The 11-Step Pipeline
From OSM extract to GeoParquet output
- Extract winter_sports – Pull ski areas and winter sport facilities from OSM
- osm_nearby – Extract OSM features within ~2 km of each ski area
- lifts and pistes – Extract lift lines and piste (trail) geometries
- enrich – Add boundaries, administrative data, and enrich attributes
- analyze – Compute statistics (trail counts, elevation, area, etc.)
- parquet – Export to GeoParquet format for compact storage and fast reads
- 1000 ft buffer – Build a buffer polygon around each ski area for mapping
- translate – Add or fill English names for resort display
- elevation / contours – Attach elevation and contour data per ski area
- re-export CSV – Regenerate analyzed CSV with elevation and final fields
- combine_regions – Merge all regional outputs into one global dataset
Regions & Deployment
Scale by region, merge globally
Regions are defined in config/regions.yaml. Large areas (Europe, North America, Asia) are split into countries, states, or sub-regions so each run stays manageable. After processing, we combine regional outputs into a single global dataset using our combine_regions script.
The pipeline runs either locally with Docker or on AWS ECS Fargate for continent-wide batch jobs. Full Europe or North America runs take roughly 5–8 hours each.
View globalskiatlas_data on GitHub
Datasets
GeoParquet format — use with Pandas, DuckDB, GeoPandas
Each file has embedded geometry. Download below:
Why Iceberg & AWS Glue?
So lots of people and apps can use the same data without stepping on each other’s toes.
We have millions of rows about ski areas, lifts, and trails. If we only kept them in one big file, only one person could update them at a time, and it’d be easy to overwrite someone else’s work.
Apache Iceberg is like a tidy filing system in the cloud: it keeps the data in chunks, tracks changes over time, and lets many tools read or write without breaking anything. AWS Glue is the “card catalog” that tells everyone where to find those files—so data scientists, apps, and this website can all use the same tables without getting lost.
Together they give us one shared source of truth for ski data that stays consistent and is easy to query. The numbers below are live from that system.
How we query it (query_iceberg.py)/api/iceberg-stats (Lambda)Further Reading
Pipeline docs in the globalskiatlas_data repo
- LOCAL_WORKFLOW.md – Run the pipeline locally with Docker
- RUN_BY_REGION.md – Region layout, PBF sizes, OOM avoidance
- WORLD_SCALE.md – Roadmap for world-scale data and serving
- AWS_ECS_DEPLOYMENT.md – Deploy to AWS ECS Fargate and S3