SeaBiscuit

Inference on Idle Hardware

SeaBiscuit partitions PyTorch and ONNX models into blocks that fit the memory of available CPUs and GPUs. It schedules those blocks on under‑used hardware within a server and across servers, so inference stays on‑premises and no external GPUs are required.

Why SeaBiscuit?

Enterprise servers typically run at 12–18 % CPU utilisation. Many idle GPUs and CPUs are underused. At the same time, large models are sent to the public cloud for inference. SeaBiscuit uses the unused capacity in the data‑centre and removes the need for external inference services.

How It Works

Minimum network for cross‑server splits

SeaBiscuit assumes at least 25 GbE or HDR InfiniBand between servers when a single model spans hosts. If only 10 GbE is available, the planner keeps large tensors inside one server and uses cross‑server links only for small activations.

Example

A 70‑billion‑parameter model runs on four 16 GB GPUs and two CPUs across two adjacent servers connected by 100 GbE. The partitioner creates 18 blocks (max 14 GB). Inference latency is within 1.3 × of a single H100 while using existing hardware.

Observed Results

MetricTypical change
Server utilisation3–5× increase
External GPU spendReduced or eliminated
P99 inference latency5–20× lower vs. cloud round‑trip
Data egressZero – data stays on‑site

Feature Comparison

CapabilityTypical schedulersSeaBiscuit
Automatic partitioningStatic or manualMin‑cut at runtime
CPU/GPU utilisationManual pinningCost‑based placement
Network awarenessNoneBandwidth & latency in cost model
Live migrationRestart required<50 ms online
SetupConfig filesInstall agent + one CLI

Outputs