Reproducible builds

Why reproducible builds?

From reproducible-builds.org:

Why does it matter?

Whilst anyone may inspect the source code of free and open source software for malicious flaws, most software is distributed pre-compiled with no method to confirm whether they correspond.

This incentivises attacks on developers who release software, not only via traditional exploitation, but also in the forms of political influence, blackmail or even threats of violence.

This is particularly a concern for developers collaborating on privacy or security software: attacking these typically result in compromising particularly politically-sensitive targets such as dissidents, journalists and whistleblowers, as well as anyone wishing to communicate securely under a repressive regime.

Whilst individual developers are a natural target, it additionally encourages attacks on build infrastructure as a successful attack would provide access to a large number of downstream computer systems. By modifying the generated binaries here instead of modifying the upstream source code, illicit changes are essentially invisible to its original authors and users alike.

How?

Cartesi Machine, an emulator that acts in a reproducible manner, is the perfect setting for reproducible builds as no non-deterministic behavior interferes with a build process.

In short: We can already today build a Docker container in a deterministic manner inside a Cartesi Machine

In addition, we can do byte byte-by-byte reproducible Ubuntu 22.04 machine base images w/ Docker inside

Downsides:

  • All Docker build context needs to be inside the Cartesi Machine, no network used and this includes Docker layers/images the Dockerfile uses

  • With Lambada’s IPFS dehashing device, we can at least get content-addressed data within a Cartesi Machine in a easy wget’able manner

  • A lot of Dockerfiles include fetching source files off internet and that may require changes to be fetching these with IPFS CIDs (manual work)

  • Compilation will be single threaded and not as fast as normal build farms. Cross-compilation/user-mode emulation of binaries doing build may be slow. But if we publish these w/ Dave you don’t necessarily need to reproduce it locally yourself all the time.

  • SHA256 hashing seems a bit slow in Docker in Cartesi Machine

Future work:

  • We might want to consider adding for example a Nitro Enclaves HTTP proxy that acts in a deterministic manner, that we can use as part of builds and fetch through IPFS

  • Making reproducible builds more ‘practical’ for end users

  • Other CPU binary emulation using host-guest communication in a deterministic manner (instead of qemu-binfmt and user mode emulation)

Where should it be used?

All Docker containers or kernel or ROM or machine images or binary distributions that deliver security-critical software to end users or nodes or developers should be done in a reproducible manner such that anyone (including EVM w/ Dave can reproduce the build process) can be sure they’re built from same source code and dependencies. The same goes for any smart contracts, etc.

Being able to make reproducible builds matter not just for web3 but for wider software industry, of open source software. Imagine decentralised git (non-GitHub) repositories controlled by DAOs producing binary software builds.

2 Likes

I like this initiative and touches a big industry-wide problem on having a reliable source code. At same time I feel the co-processor initiative would be much more relevant to the Cartesi ecosystem from a priority point of view. Thoughts?

Even without considering the wider non-web3 applications, I believe this proposal is a must-have for any serious application using Cartesi Machines. I fully support it.

In short: without reproducible builds, there is no way for a user to inspect an application’s source code to decide if he/she should trust it.

Practically speaking, in Rollups and Compute an application is defined by the template hash of a Cartesi Machine. So, when a user interacts with such an application, he/she has a guarantee that any validator securing the application will execute a Cartesi Machine whose snapshot exactly matches that hash.
From there, if, and only if, the application build is reproducible, the user can download the source code that allegedly corresponds to the application, build the machine locally, and check that the machine he built himself matches the one that the validators are executing. Then, and only then, can he be sure that the source code he’s reading, analyzing and testing corresponds to real thing.

The current status is that Cartesi DApps may be reproducible, but it’s up to the developer to ensure that - and it is definitely not an easy task, especially for applications where the build process needs to compile code.

Important note: in Ethereum, this reproducibility is ensured by the Solidity compiler. That is how Etherscan is able to verify contracts.

2 Likes

I didn’t understand the proposal well. I get it that the CM can help with the build itself using the reproducibility property. But for the dependencies, address them via IPFS with CIDs will be very hard to manage (if that is the proposal). Dependencies are usually addressed by name + version and usually have a hash associated so you are sure you are getting the one you got before. I know most users don’t configure their dependencies like that though, for different reasons, but the manual work required to do it right using names + versions + hash is the same as using a CID, but the former has the advantage of being more clear and readable.

For dependencies, maybe the idea would be to fetch a known mirror archive, make it available inside the Cartesi Machine, and then install from there? @carsten.munk would you like to comment?

But I mean dependencies as a loose concept. It can be debian packages, npm, cargo, pip, maven, zips, tars, and the list goes on. You’d need an archive for all that.

One additional comment: from a developer experience perspective, I would love this proposal to result in an easy to use solution for devs.

Maybe something like a sunodo build --reproducible command that would build the DApp inside the CM to ensure the result is reproducible - it would be used only when deploying to Mainnet, so I guess it would be acceptable if it takes hours or even days to finish.

For dependencies, maybe the idea would be to fetch a known mirror archive, make it available inside the Cartesi Machine, and then install from there? @carsten.munk would you like to comment?

I’m intentionally being a bit vague in the ‘how’ here because it really depends what’s being built. As a roadmap item I think the availability of bare minimum reproducible builds of our software should be here. As well as dapps, ideally.

In general, a deterministic build process can happen if the entire build context is available. How that becomes available, is a backend/UX question. I’ve proposed IPFS mirrors of things, HTTP oracles, or other ecosystem things that can be separately accessed.

I don’t understand how this would be achieved or how hard of a challenge it is.

Just wanted to second Milton’s words here, I believe this is a fundamental piece of Cartesi. Not having it messes up a bunch of the trust assumptions.

This is super relevant and pretty hard to achieve. We currently have an initiative on this front funded by the CGP (Cartenix). The Cartenix guys are working on building reproducible software inside the Cartesi Machine by leveraging the nix build system. If we build a mirror that is populated by packages built using Cartenix, we would ensure a very high level of security on the packages. The mirror would first be populated with packages that have no dependencies and build it´s way up. Every package would be a different computation performed inside a Cartesi Machine, and packages higher in the hierarchy would use previously Cartenix built packages for their dependencies.

1 Like