Nix also needs the build output to be deterministic to calculate the hash. It also has the problems of timestamps etc. The build environment tries to be hermetic by setting the time to be epoch among other things.
These seem very reasonable, the workarounds used are natural, and overall the article is not at all congruous with the conclusion in the (clickbait?) title?
Reading this, I think low level engineering is actually more dependent on specific environments. Hardware also has its own points of change. Usually, when you think at a high level, environmental changes are less significant than you might expect. But low level thinking tends to be tied to specific environments, which is what makes it difficult. The reason low level is hard is that even if the code itself is short, the hidden assumptions inside it are difficult and place a heavy cognitive load on the programmer. For example, even a short snippet in C like
`int value = (int)buffer`
requires a lot of implicit knowledge about the 4 byte alignment of the buffer, or whether int is exactly 32 bits. LLMs do not seem to be very good at knowing these things. Rather, they are strong at high level wrapping, but at the low level, they seem surprisingly difficult and somewhat useless. Hardware has CPU generation changes, and in the case of PLCs, where I mainly work, the protocol differences between vendors are far too severe. There does not seem to be any technology with a very long lifecycle.
Depends on what you mean by low level I guess. Compared to web application framework churn rate, simple procedural programming without many dependencies is remarkably stable. You tend to program in a way that works for most platforms (all targetted platforms). How to best do that you learn over the years. To me personally it's very refreshing if the environment around you does not constantly change. That affords learning a bag of tricks and a list of gotchas to avoid.
Time and date are... tolerable. There's SOURCE_DATE_EPOCH which should always be set to whack it into submission when used. ASLR of the _compiler being invoked_ resulting in a difference in the _program being compiled_ is nuts and would break any self-hosting compiler with consistency checks.
To avoid all those grotesque and absurd compilers and runtimes, more for those of computer languages with a ultra-complex syntax (c++ and similar), I now design "binary specifications" which I "design" and "validate" with RISC-V assembly coding.
Here, since any whatwg cartel web engine is an issue, the author should not bother.
If Clang generated non-deterministic output due to pointer addresses then that's a bug (happens regularly) that should be fixed. The most common way this happens if it some code path is iterating over a DenseMap which is non-deterministic. Sometimes that's fine and sometimes that's not depending on how that map is used. The common way to fix that is to switch to a MapVector which pays some additional runtime/memory cost to guarantee deterministic iteration order.
I'll try and make a minimal reproduction case and file a bug. Do you know if any tooling that can take a binary and fuzz it down to a minimal reproduction set?
> lol you'd think, but no, it's not. In theory it is (and for small scale compilers it definitely is), but in practice compilers are strange and complicated beasts containing multitudes that no mere mortal can fully comprehend on their own.
Umm there are like 60,000 lit tests checked into LLVM that verify that the output is absolutely a deterministic function of the input of at least that compiler.
> Even though the source code had the same bytes, the output of the compiler was wildly different.
This is the goofiest I've seen written unironically in quite a long - the C preprocessor is not part of the compiler. The pre in preprocessor should probably give it away.
Just a tip: you should probably actually understand something before you decide you hate it.
> This is the goofiest I've seen written unironically in quite a long - the C preprocessor is not part of the compiler. The pre in preprocessor should probably give it away.
This is true but doesn't seem relevant; does replacing the word "compiler" with "build chain" change anything? Because that seems like the clear meaning.
On the off chance that you’re serious, that would result in disastrously bad output. The difference between “jmp $+15” and “jmp $+16” is inscrutable and the LLM would not be able to pick the right one without tooling.
That tooling is a compiler. The higher level, the better chance the LLM can be steered to good output. Machine code is hopeless, don’t bother.
Generative algorithms have been studied for decades now and while they have led to some interesting results they're a bad fit for LLMs because there's no such thing as a "plausible" binary: a small perturbation yields an unusable result.
Nix also needs the build output to be deterministic to calculate the hash. It also has the problems of timestamps etc. The build environment tries to be hermetic by setting the time to be epoch among other things.
These seem very reasonable, the workarounds used are natural, and overall the article is not at all congruous with the conclusion in the (clickbait?) title?
Compilers literally made your project possible!
Reading this, I think low level engineering is actually more dependent on specific environments. Hardware also has its own points of change. Usually, when you think at a high level, environmental changes are less significant than you might expect. But low level thinking tends to be tied to specific environments, which is what makes it difficult. The reason low level is hard is that even if the code itself is short, the hidden assumptions inside it are difficult and place a heavy cognitive load on the programmer. For example, even a short snippet in C like `int value = (int)buffer` requires a lot of implicit knowledge about the 4 byte alignment of the buffer, or whether int is exactly 32 bits. LLMs do not seem to be very good at knowing these things. Rather, they are strong at high level wrapping, but at the low level, they seem surprisingly difficult and somewhat useless. Hardware has CPU generation changes, and in the case of PLCs, where I mainly work, the protocol differences between vendors are far too severe. There does not seem to be any technology with a very long lifecycle.
Depends on what you mean by low level I guess. Compared to web application framework churn rate, simple procedural programming without many dependencies is remarkably stable. You tend to program in a way that works for most platforms (all targetted platforms). How to best do that you learn over the years. To me personally it's very refreshing if the environment around you does not constantly change. That affords learning a bag of tricks and a list of gotchas to avoid.
Time date env variables and random address... Is also input data, maybe not as a flag but still
Time and date are... tolerable. There's SOURCE_DATE_EPOCH which should always be set to whack it into submission when used. ASLR of the _compiler being invoked_ resulting in a difference in the _program being compiled_ is nuts and would break any self-hosting compiler with consistency checks.
Explicit is Better than Implicit.
The Birth and Death of Javascript really had the gift of prophecy, eh
To avoid all those grotesque and absurd compilers and runtimes, more for those of computer languages with a ultra-complex syntax (c++ and similar), I now design "binary specifications" which I "design" and "validate" with RISC-V assembly coding.
Here, since any whatwg cartel web engine is an issue, the author should not bother.
If Clang generated non-deterministic output due to pointer addresses then that's a bug (happens regularly) that should be fixed. The most common way this happens if it some code path is iterating over a DenseMap which is non-deterministic. Sometimes that's fine and sometimes that's not depending on how that map is used. The common way to fix that is to switch to a MapVector which pays some additional runtime/memory cost to guarantee deterministic iteration order.
I'll try and make a minimal reproduction case and file a bug. Do you know if any tooling that can take a binary and fuzz it down to a minimal reproduction set?
> lol you'd think, but no, it's not. In theory it is (and for small scale compilers it definitely is), but in practice compilers are strange and complicated beasts containing multitudes that no mere mortal can fully comprehend on their own.
Umm there are like 60,000 lit tests checked into LLVM that verify that the output is absolutely a deterministic function of the input of at least that compiler.
> Even though the source code had the same bytes, the output of the compiler was wildly different.
This is the goofiest I've seen written unironically in quite a long - the C preprocessor is not part of the compiler. The pre in preprocessor should probably give it away.
Just a tip: you should probably actually understand something before you decide you hate it.
Tell me the flags I'm missing then: https://github.com/TecharoHQ/anubis/blob/Xe/wasm4/utils/wasm...
> This is the goofiest I've seen written unironically in quite a long - the C preprocessor is not part of the compiler. The pre in preprocessor should probably give it away.
This is true but doesn't seem relevant; does replacing the word "compiler" with "build chain" change anything? Because that seems like the clear meaning.
Re: source code producing different binaries: things like ASLR, stack canaries, optimization levels, linking, etc all lead to different binaries.
As long as the program is equivalent there isn't an actual problem here. Requiring the output to always be the same is an arbitrary restriction.
If you want to have users trust that someone else hasn't modified it, then sign it with your identity.
We'd like to verify, not trust.
LLMs should be trained on and directly output binary.
On the off chance that you’re serious, that would result in disastrously bad output. The difference between “jmp $+15” and “jmp $+16” is inscrutable and the LLM would not be able to pick the right one without tooling.
That tooling is a compiler. The higher level, the better chance the LLM can be steered to good output. Machine code is hopeless, don’t bother.
Generative algorithms have been studied for decades now and while they have led to some interesting results they're a bad fit for LLMs because there's no such thing as a "plausible" binary: a small perturbation yields an unusable result.
It should not. Abstraction in software engineering brings intelligence. (compression correlates to intelligence)
You should go back to elementary school
Technically they are, just a subset. But still a practical one, they're frequently used to produce executable files.
What other recent Musk talking points do you like to repeat?
I think you forgot the "/s"