Why reproducible builds?
From reproducible-builds.org:
Why does it matter?
Whilst anyone may inspect the source code of free and open source software for malicious flaws, most software is distributed pre-compiled with no method to confirm whether they correspond.
This incentivises attacks on developers who release software, not only via traditional exploitation, but also in the forms of political influence, blackmail or even threats of violence.
This is particularly a concern for developers collaborating on privacy or security software: attacking these typically result in compromising particularly politically-sensitive targets such as dissidents, journalists and whistleblowers, as well as anyone wishing to communicate securely under a repressive regime.
Whilst individual developers are a natural target, it additionally encourages attacks on build infrastructure as a successful attack would provide access to a large number of downstream computer systems. By modifying the generated binaries here instead of modifying the upstream source code, illicit changes are essentially invisible to its original authors and users alike.
How?
Cartesi Machine, an emulator that acts in a reproducible manner, is the perfect setting for reproducible builds as no non-deterministic behavior interferes with a build process.
In short: We can already today build a Docker container in a deterministic manner inside a Cartesi Machine
In addition, we can do byte byte-by-byte reproducible Ubuntu 22.04 machine base images w/ Docker inside
Downsides:
-
All Docker build context needs to be inside the Cartesi Machine, no network used and this includes Docker layers/images the Dockerfile uses
-
With Lambada’s IPFS dehashing device, we can at least get content-addressed data within a Cartesi Machine in a easy wget’able manner
-
A lot of Dockerfiles include fetching source files off internet and that may require changes to be fetching these with IPFS CIDs (manual work)
-
Compilation will be single threaded and not as fast as normal build farms. Cross-compilation/user-mode emulation of binaries doing build may be slow. But if we publish these w/ Dave you don’t necessarily need to reproduce it locally yourself all the time.
-
SHA256 hashing seems a bit slow in Docker in Cartesi Machine
Future work:
-
We might want to consider adding for example a Nitro Enclaves HTTP proxy that acts in a deterministic manner, that we can use as part of builds and fetch through IPFS
-
Making reproducible builds more ‘practical’ for end users
-
Other CPU binary emulation using host-guest communication in a deterministic manner (instead of qemu-binfmt and user mode emulation)
Where should it be used?
All Docker containers or kernel or ROM or machine images or binary distributions that deliver security-critical software to end users or nodes or developers should be done in a reproducible manner such that anyone (including EVM w/ Dave can reproduce the build process) can be sure they’re built from same source code and dependencies. The same goes for any smart contracts, etc.
Being able to make reproducible builds matter not just for web3 but for wider software industry, of open source software. Imagine decentralised git (non-GitHub) repositories controlled by DAOs producing binary software builds.