Why Vendoring PROJ Causes Wheel Bloat: Architecture, CI/CD Impact, and Mitigation
Understanding why vendoring PROJ causes wheel bloat requires dissecting how modern Python geospatial wheels package coordinate transformation engines under manylinux and musllinux constraints. PROJ is not a lightweight math library; it is a full geodetic database runtime. When maintainers vendor PROJ into wheels (e.g., pyproj, rasterio, fiona), the build pipeline statically links the core C library, embeds the proj.db SQLite CRS database, and pulls transitive dependencies like libsqlite3, libtiff, libcurl, and libssl. This architecture guarantees zero-runtime dependency resolution but inflates wheel sizes from ~5MB to 80–200MB per platform tag, directly impacting CI artifact storage, container image layers, and serverless cold-start latency.
The Portability Tax: Static Linking & Database Payload
The primary driver of wheel inflation is the proj.db file (~45–55MB uncompressed). Under the Geospatial C-Extension Fundamentals & ABI Architecture model, wheels must ship self-contained binaries to avoid ImportError: libproj.so.XX: cannot open shared object file on target runners. To achieve this, the build links PROJ and GDAL and then auditwheel (Linux) or delocate (macOS) copies the shared .so/.dylib files — together with the proj.db database — into the wheel, so imports resolve entirely against bundled, version-pinned binaries.
Additionally, PROJ’s CMake build system enables optional features by default:
PROJ_DATAdirectory bundling (CRS grids, transformation pipelines, EPSG authority tables)libcurlfor remote grid fetching (PROJ_NETWORK=ON)libtiff/libgeotifffor raster coordinate extractionlibsqlite3(vendored or system) forproj.dbI/O
When combined with auditwheel repair, the tool recursively patches RPATH and copies every dynamically linked .so into the wheel’s .libs/ directory. If the build environment lacks strict dependency pruning, the final .whl contains duplicate symbols, debug tables, and unused locale files. This trade-off is explicitly documented in the Vendoring PROJ and GDAL vs System Libraries architectural comparison, where portability is prioritized over artifact size.
A typical vendored spatial wheel breaks down roughly as follows — the database and native libs dominate:
CI/CD Failure Modes & Exact Error-to-Fix Mapping
Wheel bloat manifests in CI/CD pipelines through predictable failure modes and latency spikes. The following matrix maps exact error signatures to production-grade remediation steps.
| Symptom | Exact Error Signature | Root Cause | Exact Fix |
|---|---|---|---|
pip install timeout |
ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. |
Wheel >50MB exceeds default PyPI CDN chunking thresholds on slow runners or corporate proxies. | 1. Enable pip install --prefer-binary in CI.2. Strip binaries: find . -name "*.so" -exec strip --strip-unneeded {} +3. Strip at link time with LDFLAGS=-s (the .whl is already zip-compressed). |
auditwheel failure |
ERROR: auditwheel repair failed: Too many dependencies (libcurl.so.4, libtiff.so.6, libsqlite3.so.0) |
Static linking disabled; dynamic deps copied into wheel, violating manylinux policy. |
1. Pass --exclude libcurl --exclude libtiff --exclude libsqlite3 to auditwheel repair.2. Verify CMAKE_ARGS="-DBUILD_SHARED_LIBS=OFF" in build step. |
| Runtime CRS failure | CRSError: Invalid projection: +proj=longlat +datum=WGS84 +no_defs |
proj.db stripped or PROJ_DATA path misconfigured in vendored layout. |
1. Set PROJ_DATA=/path/to/wheel/share/proj in runtime env.2. Validate DB integrity: sqlite3 proj.db "SELECT count(*) FROM spatial_ref_sys;" (expect >6000). |
| Lambda/Container OOM | MemoryError: Unable to allocate 128 MiB for shared object mapping |
Unstripped .so with debug symbols + embedded SQLite cache mapping. |
1. Run strip -s on all .so files.2. Disable PROJ_NETWORK=ON.3. Set PROJ_CACHE_SIZE=0 to disable in-memory grid caching. |
Targeted Build Configuration
To enforce size constraints at the pipeline level, inject these flags into your pyproject.toml or setup.cfg build backend:
[tool.cibuildwheel]
environment = { PROJ_NETWORK="OFF", CFLAGS="-Os -ffunction-sections -fdata-sections", LDFLAGS="-Wl,--gc-sections -s" }
before-build = "rm -rf /opt/_internal/cpython-*/lib/python*/test"
Pre-Flight Validation & Post-Build Auditing
Deploying oversized or malformed geospatial wheels breaks downstream reproducibility. Implement these validation gates in your CI workflow.
1. Dependency Boundary Verification
Run auditwheel show to confirm compliance with PEP 599/600 platform tags:
auditwheel show dist/*.whl
Expected output should list 0 external dependencies. Any .so outside the wheel’s .libs/ directory indicates a broken vendoring boundary.
2. Symbol & Size Audit
Strip debug tables and verify payload distribution:
# Remove DWARF debug info
find . -name "*.so" -exec strip --strip-debug {} +
# Breakdown wheel contents
unzip -l dist/*.whl | awk '{print $1}' | sort | uniq -c | sort -rn
Target: proj.db ≤ 55MB, .so files ≤ 15MB, total wheel ≤ 80MB for x86_64.
3. Runtime Projection Smoke Test
Validate that the vendored CRS engine initializes without system fallback:
PROJ_DATA="" PROJ_LIB="" python -c "
from pyproj import CRS
crs = CRS.from_epsg(4326)
assert crs.to_proj4() == '+proj=longlat +datum=WGS84 +no_defs +type=crs'
print('Vendored PROJ runtime validated.')
"
Clearing PROJ_DATA and PROJ_LIB forces the extension to use only the bundled database. Failure indicates incorrect RPATH or missing share/proj directory.
PyPA Compliance & Spatial Data Constraints
Vendoring PROJ intersects with strict packaging standards. PyPA’s manylinux specification explicitly forbids linking against system libraries that are not guaranteed across glibc versions. PROJ’s transitive dependencies (libcurl, libtiff, libssl) must be statically linked or explicitly excluded via auditwheel --exclude.
Spatial data licensing adds another constraint. The proj.db database contains EPSG and NGA grid definitions distributed under PROJ’s BSD-style license. Stripping or modifying proj.db to reduce size can invalidate CRS transformations and violate redistribution terms. Instead of deleting grid files, use PROJ_NETWORK=OFF and prune legacy transformation grids (conus, ntv2_0.gsb) that are rarely used in modern pipelines.
When building for musllinux (Alpine-based containers), replace auditwheel with delocate-style static bundling or use patchelf --set-rpath to point to the wheel’s internal .libs/ directory. Dynamic linking against Alpine’s musl libc will cause silent ABI mismatches with glibc-based runners.
Strategic Mitigation for Production Pipelines
To maintain geospatial correctness without sacrificing CI velocity:
- Split Wheels by Use Case: Publish
corewheels (minimalproj.db, no grids) andfullwheels (complete CRS database) usingpip install pyproj[full]. - Cache
proj.dbSeparately: Mountproj.dbas a read-only volume in containerized deployments. Wheels then only ship the C extension (~12MB). - Enforce Size Gates in CI: Add
test "$(stat -c%s dist/*.whl)" -lt 85000000to post-build steps. Fail the pipeline if thresholds are breached. - Automate Symbol Pruning: Integrate
objcopy --remove-section=.comment --remove-section=.noteinto the build stage to strip non-essential metadata.
Vendoring PROJ guarantees deterministic coordinate transformations across heterogeneous infrastructure, but the binary tax must be actively managed. By enforcing strict static linking, pruning transitive .so files, and validating proj.db integrity at build time, teams can maintain PyPA compliance while keeping container layers and serverless payloads within operational limits.