# How Fluid.sh Sandboxes Work
Intro
When you ask Fluid to spin up a new sandbox, you aren’t waiting for a full OS installation. Instead, we use a Linked Clone mechanism that provisions a fresh, isolated environment in milliseconds. Here is a deep dive into how it works, why it’s safe, and what’s next.
The Mechanism: Linked Clones & Overlays
At the heart of Fluid’s cloning engine is the Copy-On-Write (COW) strategy.
- The Golden Image: We start with a “base” or “golden” image (e.g., a standard Ubuntu cloud image). This file remains read-only and untouched.
- The Overlay (QCOW2): When you request a new environment, we don’t copy that massive base image. Instead, we create a tiny “overlay” file using
qemu-img:
qemu-img create -f qcow2 -F qcow2 -b /path/to/base.img /path/to/overlay.qcow2This overlay records only the changes made by the new VM. It starts at a few kilobytes, making creation near-instantaneous and incredibly storage-efficient.
The “Identity Crisis”: Making Clones Unique
A raw clone of a disk is dangerous—it has the same SSH keys, the same static IP configs, and the same system identity as the parent. Fluid solves this using a two-step “Identity Reset” during the clone process.
- Libvirt XML Mutation Before defining the new VM in Libvirt, we parse the base VM’s XML configuration and aggressively sanitize it:
- UUID Removal: We strip the old UUID so Libvirt assigns a brand new, unique identifier.
- MAC Address Regeneration: We generate a fresh, random MAC address (using the 52:54:00 prefix) to ensure the network stack sees a new device.
- Disk Swapping: We point the VM’s primary drive to our new, empty overlay file instead of the base image.
- The Cloud-Init “Amnesia” Trick This is the most critical safety feature. Linux distributions running cloud-init will typically run setup once and then mark themselves as “done.” To force the clone to re-identify itself, we generate a custom cloud-init.iso for every single clone containing a new instance-id:
# meta-data
instance-id: <new-vm-name>
local-hostname: <new-vm-name>
When the clone boots, it sees a new instance-id via the attached ISO. This signals cloud-init to run again, triggering:
- Fresh DHCP negotiation (getting a new IP for the new MAC).
- Regeneration of SSH host keys (if configured).
- User creation and SSH key injection.
This ensures that even though the disk is a clone, the OS thinks it’s booting for the first time.
Why It’s Safe
- Isolation: The base image is locked. Corruption in one sandbox cannot affect others or the base.
- Network Safety: Unique MACs and forced DHCP renewal prevent IP conflicts on the bridge.
- Ephemeral Nature: Because the state lives in a disposable overlay, “wiping” a machine is as simple (and fast) as deleting a small file.
Pre-flight Resource Checks
If a Libvirt host is running low on RAM or Disk space, the clone operation might succeed (because the overlay is small), but the VM will fail to boot or crash later when it tries to write data.
Before creating a clone, Fluid querys the host’s stats:
- RAM Check: Use
virsh nodeinfoto calculate available memory vs. the requested VM size. - Disk Space Projection: While overlays start small, they can grow to the virtual size of the base image. Fluid makes sure there is a 20% buffer before cloning.
- Safety Policy: Ensure the host has enough headroom (e.g., at least 10-20% free buffer) to accommodate the potential growth of active overlays,
or implement strict disk quotas (using
virtio-blkquotas) to prevent one runaway log file from filling the host disk.
- Safety Policy: Ensure the host has enough headroom (e.g., at least 10-20% free buffer) to accommodate the potential growth of active overlays,
or implement strict disk quotas (using
The “Janitor”: Sandbox Cleanup
To prevent idling sandboxes that are done being used but never got destroyed, there exists a Janitor process that checks the variables DEFAULT_TTL and if a sandbox exists older then that it gets removed. This is how the Fluid Remote server works.
The Fluid Terminal Agent on the otherhand keeps track of created sandboxes and if they are older than 24 hours or if the terminal agent is getting closed, they will then get removed, whichever happens first.
What’s Next
Now the thing is, I know this setup isn’t perfect. It has its flaws, who wants to run additional VMs on your libvirt hosts? It’s not ideal but it got me close enough to production that I didn’t mind for now. Thankfully there are a lot of different containerizing/isolation technology to choose from. If I had my way in the world I would build out a firecracker implimentation that brings the sandbox on the same network level as the host, giving incredibly fast startup with a cloned VM copy-on-write and the ability for sandboxes to never touch your infrastructure. Eventually I will get there but for the MVP, this was good enough :).