A list of fun things we've done for CI runners to improve CI:
- Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)
- Benchmarked EC2 instance types (m7a is the best x86 today, m8g is the best arm64)
- "Warming" the root EBS volume by accessing a set of priority blocks before the job starts to give the job full disk performance [0]
- Launching each runner instance in a public subnet with a public IP - the runner gets full throughput from AWS to the public internet, and IP-based rate limits rarely apply (Docker Hub)
- Configuring Docker with containerd/estargz support
- Just generally turning kernel options and unit files off that aren't needed
> Launching each runner instance in a public subnet with a public IP - the runner gets full throughput from AWS to the public internet, and IP-based rate limits rarely apply (Docker Hub)
Are you not using a caching registry mirror, instead pulling the same image from Hub for each runner...? If so that seems like it would be an easy win to add, unless you specifically do mostly hot/unique pulls.
The more efficient answer to those rate limits is almost always to pull less times for the same work rather than scaling in a way that circumvents them.
Today we (Depot) are not, though some of our customers configure this. For the moment at least, the ephemeral public IP architecture makes it generally unnecessary from a rate-limit perspective.
From a performance / efficiency perspective, we generally recommend using ECR Public images[0], since AWS hosts mirrors of all the "Docker official" images, and throughput to ECR Public is great from inside AWS.
If you’re running inside AWS us-east-1 then docker hub will give you direct S3 URLs for layer downloads (or it used to anyway)
Any pulls doing this become zero cost for docker hub
Any sort of cache you put between docker hub and your own infra would probably be S3 backed anyway, so adding another cache in between could be mostly a waste
Yeah we do some similar tricks with our registry[0]: pushes and pulls from inside AWS are served directly from AWS for maximum performance and no data transfer cost. Then when the client is outside AWS, we redirect all that to Tigris[1], also for maximum performance (CDN) and minimum data transfer cost (no cost from Tigris, just the cost to move content out of AWS once).
Forgive me, I'm not trying to be argumentative, but doesn't Linux (and presumably all modern OSes) already have a ram-backed writeback cache for filesystems? That sounds exactly like the page cache.
No worries, entirely valid question. There may be ways to tune page cache to be more like this, but my mental model for what we've done is effectively make reads and writes transparently redirect to the equivalent of a tmpfs, up to a certain size. If you reserve 2GB of memory for the cache, and the CI job's read and written files are less than 2GB, then _everything_ stays in RAM, at RAM throughput/IOPS. When you exceed the limit of the cache, blocks are moved to the physical disk in the background. Feels like we have more direct control here than page cache (and the page cache is still helping out in this scenario too, so it's more that we're using both).
> reads and writes transparently redirect to the equivalent of a tmpfs, up to a certain size
The last bit (emphasis added) sounds novel to me, I don't think I've heard before of anybody doing that. It sounds like an almost-"free" way to get a ton of performance ("almost" because somebody has to figure out the sizing. Though, I bet you could automate that by having your tool export a "desired size" metric that's equal to the high watermark of tmpfs-like storage used during the CI run)
Just to add, my understanding is that unless you also tune your workload writes, the page cache will not skip backing storage for writes, only for reads. So it does make sense to stack both if you're fine with not being able to rely on peristence of those writes.
No, it's more like swapping pages to disk when RAM is full, or like using RAM when the L2 cache is full.
Linux page cache exists to speed up access to the durable store which is the underlying block device (NVMe, SSD, HDD, etc).
The RAM-backed block device in question here is more like tmpfs, but with an ability to use the disk if, and only if, it overflows. There's no intention or need to store its whole contents on the durable "disk" device.
Hence you can do things entirely in RAM as long as your CI/CD job can fit all the data there, but if it can't fit, the job just gets slower instead of failing.
If you clearly understand your access patterns and memory requirements, you can often outperform the default OS page cache.
Consider a scenario where your VM has 4GB of RAM, but your build accesses a total of 6GB worth of files. Suppose your code interacts with 16GB of data, yet at any moment, its active working set is only around 2GB. If you preload all Docker images at the start of your build, they'll initially be cached in RAM. However, as your build progresses, the kernel will begin evicting these cached images to accommodate recently accessed data, potentially even files used infrequently or just once. And that's the key bit, to force caching of files you know are accessed more than once.
By implementing your own caching layer, you gain explicit control, allowing critical data to remain persistently cached in memory. In contrast, the kernel-managed page cache treats cached pages as opportunistic, evicting the least recently used pages whenever new data must be accommodated, even if this new data isn't frequently accessed.
That is true and correct, except that Linux does not have raw devices, and O_DIRECT on a file is not a complete replacement for the raw devices (the buffer cache still gets involved as well as the file system).
The ramdisk that overflows to a real disk is a cool concept that I didn't previously consider. Is this just clever use of bcache? If you have any docs about how this was set up I'd love to read them.
`apt` installation could be easily sped-up with `eatmydata`: `dpkg` calls `fsync()` on all the unpacked files, which is very slow on HDDs, and `eatmydata` hacks it out.
Can tmpfs be backed by persistent storage? Most of the recent stuff I've worked on is a little too big to fit in memory handily. Ideally about 20GiB of scratch space for 4-8GiB of working memory would be ideal.
I've had good success with machines that have NVMe storage (especially on cloud providers) but you still are paying the cost of fsync there even if it's a lot faster
tmpfs is backed by swap space, in the sense that it will overflow to use swap capacity but will not become persistent (since the lack of persistence is a feature).
We built EtchaOS for this use case--small, immutable, in memory variants of Fedora, Debian, Ubuntu, etc bundled with Docker. It makes a great CI runner for GitHub Actions, and plays nicely with caching:
Why do it at the block level (instead of tmpfs)? Or do you mean that you're doing actual real persistent disks that just have a lot of cache sitting in front of them?
The block level has two advantages: (1) you can accelerate access to everything on the whole disk (like even OS packages) and (2) everything appears as one device to the OS, meaning that build tools that want to do things like hardlink files in global caches still work without any issue.
You can probably use a BPF return override on fsync and fdatasync and sync_file_range, considering that the main use case of that feature is syscall-level error injection.
noatime is irrelevant because everyone has been using relatime for ages, and updating the atime field with relatime means you're writing that block to disk anyway, since you're updating the mtime field. So no I/O saved.
Actually in my experience with pulling very large images to run with docker it turns out that Docker doesn't really do any fsync-ing itself. The sync happens when it creates an overlayfs mount while creating a container because the overlayfs driver in the kernel does it.
This is a neat idea that we should try. We've tried the `eatmydata` thing to speed up dpkg, but the slow part wasn't the fsync portion but rather the dpkg database.
"`dpkg` calls `fsync()` on all the unpacked files"
Why in the world does it do that ????
Ok I googled (kagi). Same reason anyone ever does: pure voodoo.
If you can't trust the kernel to close() then you can't trust it to fsync() or anything else either.
Kernel-level crashes, the only kind of crash that risks half-written files, are no more likely during dpkg than any other time. A bad update is the same bad update regardless, no better, no worse.
Let's start from the assumption that dpkg shouldn't commit to its database that package X is installed/updated until all the on-disk files reflect that. Then if the operation fails and you try again later (on next boot or whatever) it knows to check that package's state and try rolling forward.
> If you can't trust the kernel to close() then you can't trust it to fsync() or anything else either.
A successful close does not guarantee that the data has been
successfully saved to disk, as the kernel uses the buffer cache to
defer writes. Typically, filesystems do not flush buffers when a
file is closed. If you need to be sure that the data is
physically stored on the underlying disk, use fsync(2). (It will
depend on the disk hardware at this point.)
So if you want to wait until it's been saved to disk, you have to do an fsync first. If you even just want to know if it succeeded or failed, you have to do an fsync first.
Of course none of this matters much on an ephemeral Github Actions VM. There's no "on next boot or whatever". So this is one environment where it makes sense to bypass all this careful durability work that I'd normally be totally behind. It seems reasonable enough to say it's reached the page cache, it should continue being visible in the current boot, and tomorrow will never come.
Huh? Are you thinking dpkg is sending NVMe commands itself or something? No, that's not what that manpage means. dpkg is asking the kernel to write stuff to the page cache, and then asking for a guarantee that the data will continue to exist on next boot. The second half is what fsync does. Without fsync returning success, this is not guaranteed at all. And if you don't understand the point of the guarantee after reading my previous comment...and this is your way of saying so...further conversation will be pointless...
"durability" isn't voodoo. Consider if dpkg updates libc.so and then you yank the power cord before the page cache is flushed to disk, or you're on a laptop and the battery dies.
And if yank the cord before the package is fully unpacked? Wouldn't that just be the same problem? Solving that problem involves simply unpacking to a temporary location first, verifying all the files were extracted correctly, and then renaming them into existence. Which actually solves both problems.
Package management is stuck in a 1990s idea of "efficiency" which is entirely unwarranted. I have more than enough hard drive space to install the distribution several times over. Stop trying to be clever.
Not the same problem, it's half-written file vs half of the files in older version.
> Which actually solves both problems.
it does not and you would have to guarantee that multiple rename operations are executed in a transaction. Which you can't. Unless you have really fancy filesystem.
> have to guarantee that multiple rename operations are executed in a transaction. Which you can't. Unless you have really fancy filesystem
Not strictly. You have to guarantee that after reboot you rollback any partial package operations. This is what a filesystem journal does anyways. So it would be one fsync() per package and not one per every file in the package. The failure mode implies a reboot must occur.
> It's called being correct and reliable.
There are multiple ways to achieve this. There are different requirements among different systems which is the whole point of this post. And your version of "correct and reliable" depends on /when/ I pull the plug. So you're paying a huge price to shift the problem from one side of the line to the other in what is not clearly a useful or pragmatic way.
> You have to guarantee that after reboot you rollback any partial package operations.
In both scenarios, yes. This is what dpkg database is for, it keeps info about state of each package: whatever is it installed, unpacked, configured and so on. It is required to handle interrupted update scenario, no matter if it was interrupted during package unpacking or in the configuration stage.
So far you are just describing --force-unsafe-io from dpkg. It is called unsafe because you can end up with zeroed or 0-length files long after the package has been marked as installed.
> This is what a filesystem journal does anyways.
This is incorrect. And also filesystem journal is irrelevant.
Filesystem journal protects you from interrupted writes on the disk layer. You set some flag, you write to some temporary space called journal, you set another flag, then you copy that data to your primary space, then you remove both flags. If something happens during that process you'll know and you'll be able to recover because you know in which step you were interrupted.
Without filesystem journal every power outage could result in not being able to mount the filesystem. Journal prevents that scenario. This has nothing to do with package managers, page cache or fsync.
Under Linux you do the whole write() + fsync() + rename() dance, for every file, because this is the only way you don't end up in the scenario where you've written the new file, renamed it, marked package as installed and fsynced the package manager database but the actual new file contents never left the page cache and now you have bunch of zeroes on the disk. You have to fsync(). This is semantic of the layer you are working with. No fsync(), no guarantee that data is on the disk. Even if you wrote it and closed the file hours ago. And fsynced package manager database.
> There are different requirements among different systems which is the whole point of this post.
Sorry, I was under assumption that this thread is about dpkg and fsync and your idea of "solving the problem". I just wanted to point out that, no, package managers are not "trying to be clever" and are not "stuck in the 1990s". You can't throw fsync() out of the equation, reorder bunch of steps and call this "a solution".
Pretty sure kernel doesn't have to fsync on close. In fact, you don't want it to, otherwise you're limiting the performance of your page cache. So fsync on install for dpkg makes perfect sense.
I didn't say it synced. The file is simply "written" and available at that point.
It makes no sense to trust that fsync() does what it promises but not that close() does what it promises. close() promises that when close() returns, the data is stored and some other process may open() and find all of it verbatim. And that's all you care about or have any business caring about unless you are the kernel yourself.
> Kernel-level crashes, the only kind of crash that risks half-written files [...]
You can get half-written files in many other circumstances, eg on power outages, storage failures, hw caused crashes, dirty shutdowns, and filesystem corruption/bugs.
(Nitpick: trusting the kernel to close() doesn't have anythign to do with this, like a sibling comment says)
A power outage or other hardware fault or kernel crash can happen at any unpredicted time, equally just before or just after any particular action, including an fsync().
Trusting close() does not mean that the data is written all the way to disk. You don't care if or when it's all the way written to disk during dpkg ops any more than at any of the other infinite seconds of run time that aren't a dpkg op.
close() just means that any other thing that expects to use that data may do so. And you can absolutely bank on that. And if you think you can't bank on that, that means you don't trust the kernel to be keeping track of file handles, and if you don't trust the kernel to close(), then why do you trust it to fsync()?
A power rug pull does not worsen this. That can happen at any time, and there is nothing special about dpkg.
this is actually an interesting topic and it turns out kernel never made any promises about close() and data being on the disk :)
and about kernel-level crashes: yes, but you see, dpkg creates a new file on the disk, makes sure it is written correctly with fsync() and then calls rename() (or something like that) to atomically replace old file with new one.
So there is never a possibility of given file being corrupt during update.
> Kernel-level crashes, the only kind of crash that risks half-written files, are no more likely during dpkg than any other time. A bad update is the same bad update regardless, no better, no worse.
Imagine this scenario; you're writing a CI pipeline:
1. You write some script to `apt-get install` blah blah
2. As soon as the script is done, your CI job finishes.
3. Your job is finished, so the VM is powered off.
4. The hypervisor hits the power button but, oops, the VM still had dirty disk cache/pending writes.
The hypervisor may immediately pull the power (chaos monkey style; developers don't have patience), in which case those writes are now corrupted. Or, it may use ACPI shutdown which then should also have an ultimate timeout before pulling power (otherwise stalled IO might prevent resources from ever being cleaned up).
If you rely on sync to occur at step 4 during the kernel to gracefully exit, how long does the kernel wait before it decides that some shutdown-timeout occurred? How long does the hypervisor wait and is it longer than the kernel would wait? Are you even sure that the VM shutdown command you're sending is the graceful one?
How would you fsync at step 3?
For step 2, perhaps you might have an exit script that calls `fsync`.
For step 1, perhaps you might call `fsync` after `apt-get install` is done.
This is like people that think they have designed their own even better encryption algorithm. Voodoo. You are not solving a problem better than the kernel and filesystem (and hypervisor in this case) has already done. If you are not writing the kernel or a driver or bootloader itself, then fsync() is not your problem or responsibility and you aren't filling any holes left by anyone else. You're just rubbing the belly of the budda statue for good luck to feel better.
You didn't answer any of the questions proposed by the outlined scenario.
Until you answer how you've solved the "I want to make sure my data is written to disk before the hypervisor powers off the virtual machine at the end of the successful run" problem, then I claim that this is absolutely not voodoo.
I suggest you actually read the documentation of all of these things before you start claiming that `fsync()` is exclusively the purpose of kernel, driver, or bootloader developers.
This is exactly the kind of content marketing I want to see. The IO bottleneck data and the fio scripts are useful to all. Then at the end a link to their product which I’d never heard of, in case you’re dealing with the issue at hand.
Thank you for the kind words. We’re always trying to share our knowledge even if Depot isn’t a good fit for everyone. I hope the scripts get some mileage!
TLDR: disk is often the bottleneck in builds. Use 'fio' to get performance of the disk.
If you want to truly speed up builds by optimizing disk performance, there are no shortcuts to physically attaching NVMe storage with high throughput and high IOPS to your compute directly.
That's what we do at WarpBuild[0] and we outperform Depot runners handily. This is because we do not use network attached disks which come with relatively higher latency. Our runners are also coupled with faster processors.
I love the Depot content team though, it does a lot of heavy lifting.
I'm maintaining a benchmark of various GitHub Actions providers regarding I/O speed [1]. Depot is not present because my account was blocked but would love to compare! The disk accelerator looks like a nice feature.
If you can afford, upgrade your CI runners on GitHub to paid offering. Highly recommend, less drinking coffee, more instant unit test results. Pay as you go.
As a Depot customer, I'd say if you can afford to pay for GitHub's runners, you should pay for Depot's instead. They boot faster, run faster, are a fraction of the price. And they are lovely people who provide amazing support.
This is what we focus on with Depot. Faster builds across the board without breaking the bank. More time to get things done and maybe go outside earlier.
I just migrated multiple ARM64 GitHub action Docker builds from my self hosted runner (Raspberry Pi in my homeland) to Blacksmith.io and I’m really impressed with the performance so far. Only downside is no Docker layer and image cache like I had on my self hosted runner, but can’t complain on the free tier.
Thanks for sharing. I have a custom bash script which does the docker builds currently and swapping to useblacksmith/build-push-action would take a bit of refactoring which I don't want to spend the time on now. :-)
Bummer there's no free tier. I've been bashing my head against an intermittent CI failure problem on Github runners for probably a couple years now. I think it's related to the networking stack in their runner image and the fact that I'm using docker in docker to unit test a docker firewall. While I do appreciate that someone at Github did actually look at my issue, they totally missed the point. https://github.com/actions/runner-images/issues/11786
Are there any reasonable alternatives for a really tiny FOSS project?
A list of fun things we've done for CI runners to improve CI:
- Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)
- Benchmarked EC2 instance types (m7a is the best x86 today, m8g is the best arm64)
- "Warming" the root EBS volume by accessing a set of priority blocks before the job starts to give the job full disk performance [0]
- Launching each runner instance in a public subnet with a public IP - the runner gets full throughput from AWS to the public internet, and IP-based rate limits rarely apply (Docker Hub)
- Configuring Docker with containerd/estargz support
- Just generally turning kernel options and unit files off that aren't needed
[0] https://docs.aws.amazon.com/ebs/latest/userguide/ebs-initial...
> Launching each runner instance in a public subnet with a public IP - the runner gets full throughput from AWS to the public internet, and IP-based rate limits rarely apply (Docker Hub)
Are you not using a caching registry mirror, instead pulling the same image from Hub for each runner...? If so that seems like it would be an easy win to add, unless you specifically do mostly hot/unique pulls.
The more efficient answer to those rate limits is almost always to pull less times for the same work rather than scaling in a way that circumvents them.
Today we (Depot) are not, though some of our customers configure this. For the moment at least, the ephemeral public IP architecture makes it generally unnecessary from a rate-limit perspective.
From a performance / efficiency perspective, we generally recommend using ECR Public images[0], since AWS hosts mirrors of all the "Docker official" images, and throughput to ECR Public is great from inside AWS.
[0] https://gallery.ecr.aws/
If you’re running inside AWS us-east-1 then docker hub will give you direct S3 URLs for layer downloads (or it used to anyway)
Any pulls doing this become zero cost for docker hub
Any sort of cache you put between docker hub and your own infra would probably be S3 backed anyway, so adding another cache in between could be mostly a waste
Yeah we do some similar tricks with our registry[0]: pushes and pulls from inside AWS are served directly from AWS for maximum performance and no data transfer cost. Then when the client is outside AWS, we redirect all that to Tigris[1], also for maximum performance (CDN) and minimum data transfer cost (no cost from Tigris, just the cost to move content out of AWS once).
[0]: https://depot.dev/blog/introducing-depot-registry
[1]: https://www.tigrisdata.com/blog/depot-registry/
> Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)
I'm slightly old; is that the same thing as a ramdisk? https://en.wikipedia.org/wiki/RAM_drive
Exactly, a ramdisk-backed writeback cache for the root volume for Linux. For macOS we wrote a custom nbd filter to achieve the same thing.
Forgive me, I'm not trying to be argumentative, but doesn't Linux (and presumably all modern OSes) already have a ram-backed writeback cache for filesystems? That sounds exactly like the page cache.
No worries, entirely valid question. There may be ways to tune page cache to be more like this, but my mental model for what we've done is effectively make reads and writes transparently redirect to the equivalent of a tmpfs, up to a certain size. If you reserve 2GB of memory for the cache, and the CI job's read and written files are less than 2GB, then _everything_ stays in RAM, at RAM throughput/IOPS. When you exceed the limit of the cache, blocks are moved to the physical disk in the background. Feels like we have more direct control here than page cache (and the page cache is still helping out in this scenario too, so it's more that we're using both).
> reads and writes transparently redirect to the equivalent of a tmpfs, up to a certain size
The last bit (emphasis added) sounds novel to me, I don't think I've heard before of anybody doing that. It sounds like an almost-"free" way to get a ton of performance ("almost" because somebody has to figure out the sizing. Though, I bet you could automate that by having your tool export a "desired size" metric that's equal to the high watermark of tmpfs-like storage used during the CI run)
Just to add, my understanding is that unless you also tune your workload writes, the page cache will not skip backing storage for writes, only for reads. So it does make sense to stack both if you're fine with not being able to rely on peristence of those writes.
No, it's more like swapping pages to disk when RAM is full, or like using RAM when the L2 cache is full.
Linux page cache exists to speed up access to the durable store which is the underlying block device (NVMe, SSD, HDD, etc).
The RAM-backed block device in question here is more like tmpfs, but with an ability to use the disk if, and only if, it overflows. There's no intention or need to store its whole contents on the durable "disk" device.
Hence you can do things entirely in RAM as long as your CI/CD job can fit all the data there, but if it can't fit, the job just gets slower instead of failing.
If you clearly understand your access patterns and memory requirements, you can often outperform the default OS page cache.
Consider a scenario where your VM has 4GB of RAM, but your build accesses a total of 6GB worth of files. Suppose your code interacts with 16GB of data, yet at any moment, its active working set is only around 2GB. If you preload all Docker images at the start of your build, they'll initially be cached in RAM. However, as your build progresses, the kernel will begin evicting these cached images to accommodate recently accessed data, potentially even files used infrequently or just once. And that's the key bit, to force caching of files you know are accessed more than once.
By implementing your own caching layer, you gain explicit control, allowing critical data to remain persistently cached in memory. In contrast, the kernel-managed page cache treats cached pages as opportunistic, evicting the least recently used pages whenever new data must be accommodated, even if this new data isn't frequently accessed.
> If you clearly understand your access patterns and memory requirements, you can often outperform the default OS page cache.
I believe various RDBMSs bypass the page cache and use their own strategies for managing caching if you give them access to raw block devices, right?
That is true and correct, except that Linux does not have raw devices, and O_DIRECT on a file is not a complete replacement for the raw devices (the buffer cache still gets involved as well as the file system).
> - Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)
Everyone Linux kernel does that already. I currently have 20 GB of disk cached in RAM on this laptop.
The ramdisk that overflows to a real disk is a cool concept that I didn't previously consider. Is this just clever use of bcache? If you have any docs about how this was set up I'd love to read them.
have you tried Buildkite? https://buildkite.com
`apt` installation could be easily sped-up with `eatmydata`: `dpkg` calls `fsync()` on all the unpacked files, which is very slow on HDDs, and `eatmydata` hacks it out.
Really if you could just disable fsync at the OS level. A bunch of other common package managers and tools also do. Docker is a big culprit
If you corrupt a CI node, whatever. Just rerun the step
CI containers should probably run entirely from tmpfs.
Can tmpfs be backed by persistent storage? Most of the recent stuff I've worked on is a little too big to fit in memory handily. Ideally about 20GiB of scratch space for 4-8GiB of working memory would be ideal.
I've had good success with machines that have NVMe storage (especially on cloud providers) but you still are paying the cost of fsync there even if it's a lot faster
tmpfs is backed by swap space, in the sense that it will overflow to use swap capacity but will not become persistent (since the lack of persistence is a feature).
We built EtchaOS for this use case--small, immutable, in memory variants of Fedora, Debian, Ubuntu, etc bundled with Docker. It makes a great CI runner for GitHub Actions, and plays nicely with caching:
https://etcha.dev/etchaos/
We're having some success with doing this at the block level (e.g. in-memory writeback cache).
Why do it at the block level (instead of tmpfs)? Or do you mean that you're doing actual real persistent disks that just have a lot of cache sitting in front of them?
The block level has two advantages: (1) you can accelerate access to everything on the whole disk (like even OS packages) and (2) everything appears as one device to the OS, meaning that build tools that want to do things like hardlink files in global caches still work without any issue.
You can probably use a BPF return override on fsync and fdatasync and sync_file_range, considering that the main use case of that feature is syscall-level error injection.
edit: Or, even easier, just use the pre-built fail_function infrastructure (with retval = 0 instead of an error): https://docs.kernel.org/fault-injection/fault-injection.html
I'd love to experiment with that and/or flags like `noatime`, especially when CI nodes are single-use and ephemeral.
noatime is irrelevant because everyone has been using relatime for ages, and updating the atime field with relatime means you're writing that block to disk anyway, since you're updating the mtime field. So no I/O saved.
atime is so exotic you shouldn't need to consider disabling it experimental. I consider it legacy at this point.
> Docker is a big culprit
Actually in my experience with pulling very large images to run with docker it turns out that Docker doesn't really do any fsync-ing itself. The sync happens when it creates an overlayfs mount while creating a container because the overlayfs driver in the kernel does it.
A volatile flag to the kernel driver was added a while back, but I don't think Docker uses it yet https://www.redhat.com/en/blog/container-volatile-overlay-mo...
Well yeah, but indirectly through the usage of Docker, I mean.
Unpacking the Docker image tarballs can be a bit expensive--especially with things like nodejs where you have tons of tiny files
Tearing down overlayfs is a huge issue, though
This is a neat idea that we should try. We've tried the `eatmydata` thing to speed up dpkg, but the slow part wasn't the fsync portion but rather the dpkg database.
No need for eatmydata, dpkg has an unsafe-io option.
Other options are to use an overlay mount with volatile or ext4 with nobarrier and writeback.
unsafe-io eliminates most fsyncs, but not all of them.
"`dpkg` calls `fsync()` on all the unpacked files"
Why in the world does it do that ????
Ok I googled (kagi). Same reason anyone ever does: pure voodoo.
If you can't trust the kernel to close() then you can't trust it to fsync() or anything else either.
Kernel-level crashes, the only kind of crash that risks half-written files, are no more likely during dpkg than any other time. A bad update is the same bad update regardless, no better, no worse.
Let's start from the assumption that dpkg shouldn't commit to its database that package X is installed/updated until all the on-disk files reflect that. Then if the operation fails and you try again later (on next boot or whatever) it knows to check that package's state and try rolling forward.
> If you can't trust the kernel to close() then you can't trust it to fsync() or anything else either.
https://man7.org/linux/man-pages/man2/close.2.html
So if you want to wait until it's been saved to disk, you have to do an fsync first. If you even just want to know if it succeeded or failed, you have to do an fsync first.Of course none of this matters much on an ephemeral Github Actions VM. There's no "on next boot or whatever". So this is one environment where it makes sense to bypass all this careful durability work that I'd normally be totally behind. It seems reasonable enough to say it's reached the page cache, it should continue being visible in the current boot, and tomorrow will never come.
Writing to disk is none of your business unless you are the kernel itself.
Huh? Are you thinking dpkg is sending NVMe commands itself or something? No, that's not what that manpage means. dpkg is asking the kernel to write stuff to the page cache, and then asking for a guarantee that the data will continue to exist on next boot. The second half is what fsync does. Without fsync returning success, this is not guaranteed at all. And if you don't understand the point of the guarantee after reading my previous comment...and this is your way of saying so...further conversation will be pointless...
"durability" isn't voodoo. Consider if dpkg updates libc.so and then you yank the power cord before the page cache is flushed to disk, or you're on a laptop and the battery dies.
> before the page cache is flushed to disk
And if yank the cord before the package is fully unpacked? Wouldn't that just be the same problem? Solving that problem involves simply unpacking to a temporary location first, verifying all the files were extracted correctly, and then renaming them into existence. Which actually solves both problems.
Package management is stuck in a 1990s idea of "efficiency" which is entirely unwarranted. I have more than enough hard drive space to install the distribution several times over. Stop trying to be clever.
> Wouldn't that just be the same problem?
Not the same problem, it's half-written file vs half of the files in older version.
> Which actually solves both problems.
it does not and you would have to guarantee that multiple rename operations are executed in a transaction. Which you can't. Unless you have really fancy filesystem.
> Stop trying to be clever.
It's called being correct and reliable.
> have to guarantee that multiple rename operations are executed in a transaction. Which you can't. Unless you have really fancy filesystem
Not strictly. You have to guarantee that after reboot you rollback any partial package operations. This is what a filesystem journal does anyways. So it would be one fsync() per package and not one per every file in the package. The failure mode implies a reboot must occur.
> It's called being correct and reliable.
There are multiple ways to achieve this. There are different requirements among different systems which is the whole point of this post. And your version of "correct and reliable" depends on /when/ I pull the plug. So you're paying a huge price to shift the problem from one side of the line to the other in what is not clearly a useful or pragmatic way.
> You have to guarantee that after reboot you rollback any partial package operations.
In both scenarios, yes. This is what dpkg database is for, it keeps info about state of each package: whatever is it installed, unpacked, configured and so on. It is required to handle interrupted update scenario, no matter if it was interrupted during package unpacking or in the configuration stage.
So far you are just describing --force-unsafe-io from dpkg. It is called unsafe because you can end up with zeroed or 0-length files long after the package has been marked as installed.
> This is what a filesystem journal does anyways.
This is incorrect. And also filesystem journal is irrelevant.
Filesystem journal protects you from interrupted writes on the disk layer. You set some flag, you write to some temporary space called journal, you set another flag, then you copy that data to your primary space, then you remove both flags. If something happens during that process you'll know and you'll be able to recover because you know in which step you were interrupted.
Without filesystem journal every power outage could result in not being able to mount the filesystem. Journal prevents that scenario. This has nothing to do with package managers, page cache or fsync.
Under Linux you do the whole write() + fsync() + rename() dance, for every file, because this is the only way you don't end up in the scenario where you've written the new file, renamed it, marked package as installed and fsynced the package manager database but the actual new file contents never left the page cache and now you have bunch of zeroes on the disk. You have to fsync(). This is semantic of the layer you are working with. No fsync(), no guarantee that data is on the disk. Even if you wrote it and closed the file hours ago. And fsynced package manager database.
> There are different requirements among different systems which is the whole point of this post.
Sorry, I was under assumption that this thread is about dpkg and fsync and your idea of "solving the problem". I just wanted to point out that, no, package managers are not "trying to be clever" and are not "stuck in the 1990s". You can't throw fsync() out of the equation, reorder bunch of steps and call this "a solution".
Like I said.
Pretty sure kernel doesn't have to fsync on close. In fact, you don't want it to, otherwise you're limiting the performance of your page cache. So fsync on install for dpkg makes perfect sense.
I didn't say it synced. The file is simply "written" and available at that point.
It makes no sense to trust that fsync() does what it promises but not that close() does what it promises. close() promises that when close() returns, the data is stored and some other process may open() and find all of it verbatim. And that's all you care about or have any business caring about unless you are the kernel yourself.
I would like to introduce you to a few case studies on bugzilla:
https://wiki.debian.org/Teams/Dpkg/FAQ#Q:_Why_is_dpkg_so_slo...
Get involved, tell and show them you know better. They have a bug tracker, mailing list etc
> Kernel-level crashes, the only kind of crash that risks half-written files [...]
You can get half-written files in many other circumstances, eg on power outages, storage failures, hw caused crashes, dirty shutdowns, and filesystem corruption/bugs.
(Nitpick: trusting the kernel to close() doesn't have anythign to do with this, like a sibling comment says)
A power outage or other hardware fault or kernel crash can happen at any unpredicted time, equally just before or just after any particular action, including an fsync().
Trusting close() does not mean that the data is written all the way to disk. You don't care if or when it's all the way written to disk during dpkg ops any more than at any of the other infinite seconds of run time that aren't a dpkg op.
close() just means that any other thing that expects to use that data may do so. And you can absolutely bank on that. And if you think you can't bank on that, that means you don't trust the kernel to be keeping track of file handles, and if you don't trust the kernel to close(), then why do you trust it to fsync()?
A power rug pull does not worsen this. That can happen at any time, and there is nothing special about dpkg.
What's your experience developing a package manager for one of the world's most popular Linux distributions?
Maybe they know something you don't ?????
this is actually an interesting topic and it turns out kernel never made any promises about close() and data being on the disk :)
and about kernel-level crashes: yes, but you see, dpkg creates a new file on the disk, makes sure it is written correctly with fsync() and then calls rename() (or something like that) to atomically replace old file with new one.
So there is never a possibility of given file being corrupt during update.
> Kernel-level crashes, the only kind of crash that risks half-written files, are no more likely during dpkg than any other time. A bad update is the same bad update regardless, no better, no worse.
Imagine this scenario; you're writing a CI pipeline:
1. You write some script to `apt-get install` blah blah
2. As soon as the script is done, your CI job finishes.
3. Your job is finished, so the VM is powered off.
4. The hypervisor hits the power button but, oops, the VM still had dirty disk cache/pending writes.
The hypervisor may immediately pull the power (chaos monkey style; developers don't have patience), in which case those writes are now corrupted. Or, it may use ACPI shutdown which then should also have an ultimate timeout before pulling power (otherwise stalled IO might prevent resources from ever being cleaned up).
If you rely on sync to occur at step 4 during the kernel to gracefully exit, how long does the kernel wait before it decides that some shutdown-timeout occurred? How long does the hypervisor wait and is it longer than the kernel would wait? Are you even sure that the VM shutdown command you're sending is the graceful one?
How would you fsync at step 3?
For step 2, perhaps you might have an exit script that calls `fsync`.
For step 1, perhaps you might call `fsync` after `apt-get install` is done.
This is like people that think they have designed their own even better encryption algorithm. Voodoo. You are not solving a problem better than the kernel and filesystem (and hypervisor in this case) has already done. If you are not writing the kernel or a driver or bootloader itself, then fsync() is not your problem or responsibility and you aren't filling any holes left by anyone else. You're just rubbing the belly of the budda statue for good luck to feel better.
You didn't answer any of the questions proposed by the outlined scenario.
Until you answer how you've solved the "I want to make sure my data is written to disk before the hypervisor powers off the virtual machine at the end of the successful run" problem, then I claim that this is absolutely not voodoo.
I suggest you actually read the documentation of all of these things before you start claiming that `fsync()` is exclusively the purpose of kernel, driver, or bootloader developers.
So... the goal is to make `apt` to be web scale?
(to be clear: my comment is sarcasm and web scale is a reference to a joke about reliability [0])
[0]: https://www.youtube.com/watch?v=b2F-DItXtZs
This is exactly the kind of content marketing I want to see. The IO bottleneck data and the fio scripts are useful to all. Then at the end a link to their product which I’d never heard of, in case you’re dealing with the issue at hand.
Thank you for the kind words. We’re always trying to share our knowledge even if Depot isn’t a good fit for everyone. I hope the scripts get some mileage!
TLDR: disk is often the bottleneck in builds. Use 'fio' to get performance of the disk.
If you want to truly speed up builds by optimizing disk performance, there are no shortcuts to physically attaching NVMe storage with high throughput and high IOPS to your compute directly.
That's what we do at WarpBuild[0] and we outperform Depot runners handily. This is because we do not use network attached disks which come with relatively higher latency. Our runners are also coupled with faster processors.
I love the Depot content team though, it does a lot of heavy lifting.
[0] https://www.warpbuild.com
I'm maintaining a benchmark of various GitHub Actions providers regarding I/O speed [1]. Depot is not present because my account was blocked but would love to compare! The disk accelerator looks like a nice feature.
[1]: https://runs-on.com/benchmarks/github-actions-disk-performan...
If you can afford, upgrade your CI runners on GitHub to paid offering. Highly recommend, less drinking coffee, more instant unit test results. Pay as you go.
As a Depot customer, I'd say if you can afford to pay for GitHub's runners, you should pay for Depot's instead. They boot faster, run faster, are a fraction of the price. And they are lovely people who provide amazing support.
[flagged]
I think you're being downvoted because this comes off a bit... I dunno... petty I guess?
Maybe reply to the original post and let your product stand on it's own merits.
This is what we focus on with Depot. Faster builds across the board without breaking the bank. More time to get things done and maybe go outside earlier.
Trading Strategy looks super cool, by the way.
I just migrated multiple ARM64 GitHub action Docker builds from my self hosted runner (Raspberry Pi in my homeland) to Blacksmith.io and I’m really impressed with the performance so far. Only downside is no Docker layer and image cache like I had on my self hosted runner, but can’t complain on the free tier.
Have you checked out https://docs.blacksmith.sh/docker-builds/incremental-docker-...? This should help setup a shared, persistent docker layer cache for your runners
Thanks for sharing. I have a custom bash script which does the docker builds currently and swapping to useblacksmith/build-push-action would take a bit of refactoring which I don't want to spend the time on now. :-)
So I had to read to the end to realize it’s a kinda infomercial. Ok fair enough. Didn’t know what depot was though.
Bummer there's no free tier. I've been bashing my head against an intermittent CI failure problem on Github runners for probably a couple years now. I think it's related to the networking stack in their runner image and the fact that I'm using docker in docker to unit test a docker firewall. While I do appreciate that someone at Github did actually look at my issue, they totally missed the point. https://github.com/actions/runner-images/issues/11786
Are there any reasonable alternatives for a really tiny FOSS project?
we are working on a platform that let's you measure this stuff in real-time for free.
you can check us out at https://yeet.cx
we also have a anonymous guest sandbox you can play with
https://yeet.cx/play