SeaBiscuit partitions PyTorch and ONNX models into blocks that fit the memory of available CPUs and GPUs. It schedules those blocks on under‑used hardware within a server and across servers, so inference stays on‑premises and no external GPUs are required.

Why SeaBiscuit?

Enterprise servers typically run at 12–18 % CPU utilisation. Many idle GPUs and CPUs are underused. At the same time, large models are sent to the public cloud for inference. SeaBiscuit uses the unused capacity in the data‑centre and removes the need for external inference services.

How It Works

Resource telemetry – an agent on each server reports FLOPS, memory, network bandwidth, and utilisation.
Graph trace – the forward pass produces a directed acyclic graph (operators + tensor edges).
Partitioning – a min‑cut solver groups operators into blocks that fit device memory and minimise cross‑device traffic.
Placement – blocks are mapped to CPUs or GPUs that have capacity, preferring same‑server or same‑rack links. Plans update every few ms.
Streaming runtime – tensors move over the fastest link available:
• Intra‑server: NVLink or PCIe + NCCL;
• Inter‑server: RDMA/ROCE on 25–100 GbE or InfiniBand with optional compression.
Back‑pressure keeps queues bounded.
Migration – if a node becomes busy, a block is reassigned in under 50 ms; requests continue in flight.

Minimum network for cross‑server splits

SeaBiscuit assumes at least 25 GbE or HDR InfiniBand between servers when a single model spans hosts. If only 10 GbE is available, the planner keeps large tensors inside one server and uses cross‑server links only for small activations.

Example

A 70‑billion‑parameter model runs on four 16 GB GPUs and two CPUs across two adjacent servers connected by 100 GbE. The partitioner creates 18 blocks (max 14 GB). Inference latency is within 1.3 × of a single H100 while using existing hardware.

Observed Results

Metric	Typical change
Server utilisation	3–5× increase
External GPU spend	Reduced or eliminated
P99 inference latency	5–20× lower vs. cloud round‑trip
Data egress	Zero – data stays on‑site

Feature Comparison

Capability	Typical schedulers	SeaBiscuit
Automatic partitioning	Static or manual	Min‑cut at runtime
CPU/GPU utilisation	Manual pinning	Cost‑based placement
Network awareness	None	Bandwidth & latency in cost model
Live migration	Restart required	<50 ms online
Setup	Config files	Install agent + one CLI

Outputs

Higher utilisation of existing hardware.
Inference cost stays inside the data‑centre budget.
Dashboard with utilisation, latency, and transfer statistics.
Support for PyTorch and ONNX models without code changes.