PandaStack

Data pipelines

Run untrusted data transformations in disposable sandboxes — perfect for ETL, scraping, and one-off batch jobs.

Pattern: parallel batch transform

You have 10,000 documents to process. Each runs a different transformation. You don't want one bad input to take down the whole job.

from pandastack import Sandbox
from concurrent.futures import ThreadPoolExecutor

def process_one(doc_id: str) -> dict:
    with Sandbox(template="python-data", cpu=1, memory_mb=512) as sb:
        sb.write(f"/tmp/input.json", json.dumps(load(doc_id)))
        result = sb.run("python /pipeline.py /tmp/input.json /tmp/output.json", timeout=60)
        if result.exit_code != 0:
            return {"doc_id": doc_id, "error": result.stderr}
        return {"doc_id": doc_id, "output": json.loads(sb.read("/tmp/output.json"))}

with ThreadPoolExecutor(max_workers=200) as ex:
    results = list(ex.map(process_one, doc_ids))

A cluster of 5 c5n.metal agents handles 1000 concurrent sandboxes comfortably.

Pattern: scraper with template

Pre-bake your scraper into a template so each sandbox boots ready-to-go:

cat > scraper/Dockerfile <<'EOF'
FROM ghcr.io/pandastack/templates:python-data
RUN pip install playwright requests beautifulsoup4 lxml
RUN playwright install chromium --with-deps
COPY scrape.py /scrape.py
EOF

pandastack templates build scraper ./scraper

Then:

with Sandbox(template="scraper", memory_mb=2048) as sb:
    result = sb.run(f"python /scrape.py {url}", timeout=30)

Each sandbox starts in <500 ms (browser template baked-in), processes one URL, dies.

Pattern: snapshot mid-pipeline for restart

For multi-hour jobs:

sb = Sandbox(template="python-data", memory_mb=4096)
sb.run("python /pipeline.py --stage extract --out /tmp/extracted.parquet")

# Snapshot before the expensive step
snap = sb.snapshot()
print(f"checkpoint: {snap.id}")

sb.run("python /pipeline.py --stage transform --in /tmp/extracted.parquet --out /tmp/transformed.parquet")

If the transform step crashes, you can restore from snap.id and retry the second half without redoing extract.

Cost notes

Sandboxes are billed per wall-clock running second + RAM-seconds. A typical scraper sandbox running for 5 seconds with 512 MiB costs ~$0.00006 on managed PandaStack.

On this page