How Fluid.sh Sandboxes Work

Intro

When you ask Fluid to spin up a new sandbox, you aren’t waiting for a full OS installation. Instead, we use a Linked Clone mechanism that provisions a fresh, isolated environment in milliseconds. Here is a deep dive into how it works, why it’s safe, and what’s next.

Traditional VMs vs Fluid Sandboxes

TRADITIONAL: 4 Full VM Clones

VM-1

CPU: 2 cores RAM: 4 GB Disk: 20 GB

VM-2

CPU: 2 cores RAM: 4 GB Disk: 20 GB

VM-3

CPU: 2 cores RAM: 4 GB Disk: 20 GB

VM-4

CPU: 2 cores RAM: 4 GB Disk: 20 GB

TOTAL DISK: 80 GB Creation: ~2-5 min each

FLUID: Copy-on-Write Sandboxes

SBX-1

CPU: 2 cores RAM: 4 GB Disk: 128 KB

SBX-2

CPU: 2 cores RAM: 4 GB Disk: 256 KB

SBX-3

CPU: 2 cores RAM: 4 GB Disk: 64 KB

SBX-4

CPU: 2 cores RAM: 4 GB Disk: 512 KB

BASE IMAGE (Read-Only)

Disk: 20 GB

TOTAL DISK: ~20 GB Creation: ~50ms each

The Mechanism: Linked Clones & Overlays

At the heart of Fluid’s cloning engine is the Copy-On-Write (COW) strategy.

The Golden Image: We start with a “base” or “golden” image (e.g., a standard Ubuntu cloud image). This file remains read-only and untouched.
The Overlay (QCOW2): When you request a new environment, we don’t copy that massive base image. Instead, we create a tiny “overlay” file using qemu-img: qemu-img create -f qcow2 -F qcow2 -b /path/to/base.img /path/to/overlay.qcow2 This overlay records only the changes made by the new VM. It starts at a few kilobytes, making creation near-instantaneous and incredibly storage-efficient.

The “Identity Crisis”: Making Clones Unique

A raw clone of a disk is dangerous—it has the same SSH keys, the same static IP configs, and the same system identity as the parent. Fluid solves this using a two-step “Identity Reset” during the clone process.

Libvirt XML Mutation Before defining the new VM in Libvirt, we parse the base VM’s XML configuration and aggressively sanitize it:

UUID Removal: We strip the old UUID so Libvirt assigns a brand new, unique identifier.
MAC Address Regeneration: We generate a fresh, random MAC address (using the 52:54:00 prefix) to ensure the network stack sees a new device.
Disk Swapping: We point the VM’s primary drive to our new, empty overlay file instead of the base image.

The Cloud-Init “Amnesia” Trick This is the most critical safety feature. Linux distributions running cloud-init will typically run setup once and then mark themselves as “done.” To force the clone to re-identify itself, we generate a custom cloud-init.iso for every single clone containing a new instance-id:

 # meta-data
 instance-id: <new-vm-name>
 local-hostname: <new-vm-name>

When the clone boots, it sees a new instance-id via the attached ISO. This signals cloud-init to run again, triggering:

Fresh DHCP negotiation (getting a new IP for the new MAC).
Regeneration of SSH host keys (if configured).
User creation and SSH key injection.

This ensures that even though the disk is a clone, the OS thinks it’s booting for the first time.

Why It’s Safe

Isolation: The base image is locked. Corruption in one sandbox cannot affect others or the base.
Network Safety: Unique MACs and forced DHCP renewal prevent IP conflicts on the bridge.
Ephemeral Nature: Because the state lives in a disposable overlay, “wiping” a machine is as simple (and fast) as deleting a small file.

Pre-flight Resource Checks

If a Libvirt host is running low on RAM or Disk space, the clone operation might succeed (because the overlay is small), but the VM will fail to boot or crash later when it tries to write data.

Before creating a clone, Fluid querys the host’s stats:

RAM Check: Use virsh nodeinfo to calculate available memory vs. the requested VM size.
Disk Space Projection: While overlays start small, they can grow to the virtual size of the base image. Fluid makes sure there is a 20% buffer before cloning.
- Safety Policy: Ensure the host has enough headroom (e.g., at least 10-20% free buffer) to accommodate the potential growth of active overlays, or implement strict disk quotas (using virtio-blk quotas) to prevent one runaway log file from filling the host disk.

The “Janitor”: Sandbox Cleanup

To prevent idling sandboxes that are done being used but never got destroyed, there exists a Janitor process that checks the variables DEFAULT_TTL and if a sandbox exists older then that it gets removed. This is how the Fluid Remote server works. The Fluid Terminal Agent on the otherhand keeps track of created sandboxes and if they are older than 24 hours or if the terminal agent is getting closed, they will then get removed, whichever happens first.

What’s Next

Now the thing is, I know this setup isn’t perfect. It has its flaws, who wants to run additional VMs on your libvirt hosts? It’s not ideal but it got me close enough to production that I didn’t mind for now. Thankfully there are a lot of different containerizing/isolation technology to choose from. If I had my way in the world I would build out a firecracker implimentation that brings the sandbox on the same network level as the host, giving incredibly fast startup with a cloned VM copy-on-write and the ability for sandboxes to never touch your infrastructure. Eventually I will get there but for the MVP, this was good enough :).