Why Vendoring PROJ Causes Wheel Bloat: Architecture, CI/CD Impact, and Mitigation

Understanding why vendoring PROJ causes wheel bloat requires dissecting how modern Python geospatial wheels package coordinate transformation engines under manylinux and musllinux constraints. PROJ is not a lightweight math library; it is a full geodetic database runtime. When maintainers vendor PROJ into wheels (e.g., pyproj, rasterio, fiona), the build pipeline statically links the core C library, embeds the proj.db SQLite CRS database, and pulls transitive dependencies like libsqlite3, libtiff, libcurl, and libssl. This architecture guarantees zero-runtime dependency resolution but inflates wheel sizes from ~5MB to 80–200MB per platform tag, directly impacting CI artifact storage, container image layers, and serverless cold-start latency.

The Portability Tax: Static Linking & Database Payload

The primary driver of wheel inflation is the proj.db file (~45–55MB uncompressed). Under the Geospatial C-Extension Fundamentals & ABI Architecture model, wheels must ship self-contained binaries to avoid ImportError: libproj.so.XX: cannot open shared object file on target runners. To achieve this, the build links PROJ and GDAL and then auditwheel (Linux) or delocate (macOS) copies the shared .so/.dylib files — together with the proj.db database — into the wheel, so imports resolve entirely against bundled, version-pinned binaries.

Additionally, PROJ’s CMake build system enables optional features by default:

  • PROJ_DATA directory bundling (CRS grids, transformation pipelines, EPSG authority tables)
  • libcurl for remote grid fetching (PROJ_NETWORK=ON)
  • libtiff/libgeotiff for raster coordinate extraction
  • libsqlite3 (vendored or system) for proj.db I/O

When combined with auditwheel repair, the tool recursively patches RPATH and copies every dynamically linked .so into the wheel’s .libs/ directory. If the build environment lacks strict dependency pruning, the final .whl contains duplicate symbols, debug tables, and unused locale files. This trade-off is explicitly documented in the Vendoring PROJ and GDAL vs System Libraries architectural comparison, where portability is prioritized over artifact size.

A typical vendored spatial wheel breaks down roughly as follows — the database and native libs dominate:

pie showData title Vendored PROJ/GDAL wheel size (approx MB) "proj.db CRS database" : 50 "GDAL and PROJ shared libs" : 60 "Transitive deps: sqlite, tiff, curl" : 25 "Python extension module" : 12

CI/CD Failure Modes & Exact Error-to-Fix Mapping

Wheel bloat manifests in CI/CD pipelines through predictable failure modes and latency spikes. The following matrix maps exact error signatures to production-grade remediation steps.

Symptom Exact Error Signature Root Cause Exact Fix
pip install timeout ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. Wheel >50MB exceeds default PyPI CDN chunking thresholds on slow runners or corporate proxies. 1. Enable pip install --prefer-binary in CI.
2. Strip binaries: find . -name "*.so" -exec strip --strip-unneeded {} +
3. Strip at link time with LDFLAGS=-s (the .whl is already zip-compressed).
auditwheel failure ERROR: auditwheel repair failed: Too many dependencies (libcurl.so.4, libtiff.so.6, libsqlite3.so.0) Static linking disabled; dynamic deps copied into wheel, violating manylinux policy. 1. Pass --exclude libcurl --exclude libtiff --exclude libsqlite3 to auditwheel repair.
2. Verify CMAKE_ARGS="-DBUILD_SHARED_LIBS=OFF" in build step.
Runtime CRS failure CRSError: Invalid projection: +proj=longlat +datum=WGS84 +no_defs proj.db stripped or PROJ_DATA path misconfigured in vendored layout. 1. Set PROJ_DATA=/path/to/wheel/share/proj in runtime env.
2. Validate DB integrity: sqlite3 proj.db "SELECT count(*) FROM spatial_ref_sys;" (expect >6000).
Lambda/Container OOM MemoryError: Unable to allocate 128 MiB for shared object mapping Unstripped .so with debug symbols + embedded SQLite cache mapping. 1. Run strip -s on all .so files.
2. Disable PROJ_NETWORK=ON.
3. Set PROJ_CACHE_SIZE=0 to disable in-memory grid caching.

Targeted Build Configuration

To enforce size constraints at the pipeline level, inject these flags into your pyproject.toml or setup.cfg build backend:

[tool.cibuildwheel]
environment = { PROJ_NETWORK="OFF", CFLAGS="-Os -ffunction-sections -fdata-sections", LDFLAGS="-Wl,--gc-sections -s" }
before-build = "rm -rf /opt/_internal/cpython-*/lib/python*/test"

Pre-Flight Validation & Post-Build Auditing

Deploying oversized or malformed geospatial wheels breaks downstream reproducibility. Implement these validation gates in your CI workflow.

1. Dependency Boundary Verification

Run auditwheel show to confirm compliance with PEP 599/600 platform tags:

auditwheel show dist/*.whl

Expected output should list 0 external dependencies. Any .so outside the wheel’s .libs/ directory indicates a broken vendoring boundary.

2. Symbol & Size Audit

Strip debug tables and verify payload distribution:

# Remove DWARF debug info
find . -name "*.so" -exec strip --strip-debug {} +

# Breakdown wheel contents
unzip -l dist/*.whl | awk '{print $1}' | sort | uniq -c | sort -rn

Target: proj.db ≤ 55MB, .so files ≤ 15MB, total wheel ≤ 80MB for x86_64.

3. Runtime Projection Smoke Test

Validate that the vendored CRS engine initializes without system fallback:

PROJ_DATA="" PROJ_LIB="" python -c "
from pyproj import CRS
crs = CRS.from_epsg(4326)
assert crs.to_proj4() == '+proj=longlat +datum=WGS84 +no_defs +type=crs'
print('Vendored PROJ runtime validated.')
"

Clearing PROJ_DATA and PROJ_LIB forces the extension to use only the bundled database. Failure indicates incorrect RPATH or missing share/proj directory.

PyPA Compliance & Spatial Data Constraints

Vendoring PROJ intersects with strict packaging standards. PyPA’s manylinux specification explicitly forbids linking against system libraries that are not guaranteed across glibc versions. PROJ’s transitive dependencies (libcurl, libtiff, libssl) must be statically linked or explicitly excluded via auditwheel --exclude.

Spatial data licensing adds another constraint. The proj.db database contains EPSG and NGA grid definitions distributed under PROJ’s BSD-style license. Stripping or modifying proj.db to reduce size can invalidate CRS transformations and violate redistribution terms. Instead of deleting grid files, use PROJ_NETWORK=OFF and prune legacy transformation grids (conus, ntv2_0.gsb) that are rarely used in modern pipelines.

When building for musllinux (Alpine-based containers), replace auditwheel with delocate-style static bundling or use patchelf --set-rpath to point to the wheel’s internal .libs/ directory. Dynamic linking against Alpine’s musl libc will cause silent ABI mismatches with glibc-based runners.

Strategic Mitigation for Production Pipelines

To maintain geospatial correctness without sacrificing CI velocity:

  1. Split Wheels by Use Case: Publish core wheels (minimal proj.db, no grids) and full wheels (complete CRS database) using pip install pyproj[full].
  2. Cache proj.db Separately: Mount proj.db as a read-only volume in containerized deployments. Wheels then only ship the C extension (~12MB).
  3. Enforce Size Gates in CI: Add test "$(stat -c%s dist/*.whl)" -lt 85000000 to post-build steps. Fail the pipeline if thresholds are breached.
  4. Automate Symbol Pruning: Integrate objcopy --remove-section=.comment --remove-section=.note into the build stage to strip non-essential metadata.

Vendoring PROJ guarantees deterministic coordinate transformations across heterogeneous infrastructure, but the binary tax must be actively managed. By enforcing strict static linking, pruning transitive .so files, and validating proj.db integrity at build time, teams can maintain PyPA compliance while keeping container layers and serverless payloads within operational limits.