SeaBiscuit partitions PyTorch and ONNX models into blocks that fit the memory of available CPUs and GPUs. It schedules those blocks on under‑used hardware within a server and across servers, so inference stays on‑premises and no external GPUs are required.
Why SeaBiscuit?
Enterprise servers typically run at 12–18 % CPU utilisation. Many idle GPUs and CPUs are underused. At the same time, large models are sent to the public cloud for inference. SeaBiscuit uses the unused capacity in the data‑centre and removes the need for external inference services.
How It Works
- Resource telemetry – an agent on each server reports FLOPS, memory, network bandwidth, and utilisation.
- Graph trace – the forward pass produces a directed acyclic graph (operators + tensor edges).
- Partitioning – a min‑cut solver groups operators into blocks that fit device memory and minimise cross‑device traffic.
- Placement – blocks are mapped to CPUs or GPUs that have capacity, preferring same‑server or same‑rack links. Plans update every few ms.
- Streaming runtime – tensors move over the fastest link available:
• Intra‑server: NVLink or PCIe + NCCL;
• Inter‑server: RDMA/ROCE on 25–100 GbE or InfiniBand with optional compression.
Back‑pressure keeps queues bounded. - Migration – if a node becomes busy, a block is reassigned in under 50 ms; requests continue in flight.
Minimum network for cross‑server splits
SeaBiscuit assumes at least 25 GbE or HDR InfiniBand between servers when a single model spans hosts. If only 10 GbE is available, the planner keeps large tensors inside one server and uses cross‑server links only for small activations.
Example
A 70‑billion‑parameter model runs on four 16 GB GPUs and two CPUs across two adjacent servers connected by 100 GbE. The partitioner creates 18 blocks (max 14 GB). Inference latency is within 1.3 × of a single H100 while using existing hardware.
Observed Results
Metric | Typical change |
---|---|
Server utilisation | 3–5× increase |
External GPU spend | Reduced or eliminated |
P99 inference latency | 5–20× lower vs. cloud round‑trip |
Data egress | Zero – data stays on‑site |
Feature Comparison
Capability | Typical schedulers | SeaBiscuit |
---|---|---|
Automatic partitioning | Static or manual | Min‑cut at runtime |
CPU/GPU utilisation | Manual pinning | Cost‑based placement |
Network awareness | None | Bandwidth & latency in cost model |
Live migration | Restart required | <50 ms online |
Setup | Config files | Install agent + one CLI |
Outputs
- Higher utilisation of existing hardware.
- Inference cost stays inside the data‑centre budget.
- Dashboard with utilisation, latency, and transfer statistics.
- Support for PyTorch and ONNX models without code changes.