The first programs were written in binary/hexadecimal, and only later did we invent coding languages to convert between human readable code and binary machine code.

So why can’t we just do the same thing in reverse? I hear a lot about devices from audio streaming to footware rendered useless by abandonware. Couldn’t a very smart person (or AI) just take the existing program and turn it into code?

    • mesamune@lemmy.world
      link
      fedilink
      English
      arrow-up
      27
      arrow-down
      1
      ·
      2 months ago

      It’s taken years for devs to decompile Zelda let alone other projects. It’s crazy how much work goes into such projects.

  • Hildegarde@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    2 months ago

    Computer code is very complicated, so when humans write code we write in a way we can understand. We name functions and variables with names that make sense, and we put comments in the code so we can understand how it works.

    Compliers don’t care about any of those things. Variable names are turned into numbers, and comments are ignored.

    You can convert machine code back to source code, it will be missing all those human readable labels and explanations. You can recreate them, but its a major process. Reverse engineering is done sometimes, but there’s a reason is not common.

    There’s also the issue of licensing. An important part of free and/or open source software is that you have permission to modify the source code. You probably don’t have a license to use the code if its closed source. There are ways to do this legally but it adds extra hurdles and inconvenience to an already major process.

  • Contramuffin@lemmy.world
    link
    fedilink
    arrow-up
    6
    ·
    2 months ago

    Yes, and people do do it. It’s just incredibly difficult to do it even for relatively simple programs, and the more complex the program is, the more exponentially hard the reverse engineering will be.

    The problem is not necessarily turning it into code, since many decompilers do it already for you nowadays. The issue is understanding what in the world the code is supposed to do. Normally, open source code would be commented and there would be documentation, so it’s easy to edit or build on the code. Decompiled code comes with no documentation or comments, and all the variable names are virtually illegible.

    It’s sometimes easier to build something new than to fix what’s broken, and this would be one of those cases where it’s true

  • Norgur@fedia.io
    link
    fedilink
    arrow-up
    58
    ·
    2 months ago

    Imagine being presented with an aircraft. You bloody well know what it does and you get permission to disassemble the whole thing to your heart’s content. How big of a task do you think it’d still be to be able to work out how the winged metal tube works and why it does what it does when it does it?

    Exactly.

  • gaiussabinus@lemmy.world
    link
    fedilink
    arrow-up
    3
    ·
    2 months ago

    It’s not. I believe lowlevellearning has a tutorial on tearing down binaries. If not him, john hammond does for sure. Both are on youtube. That skill set is usually employed in security research since it pays more than reverse engineering old software with problematic licensing and uncertain ownership.

  • FaceDeer@fedia.io
    link
    fedilink
    arrow-up
    11
    ·
    2 months ago

    As others have mentioned, it’s possible but very complicated. Decompilers produce code that isn’t very readable for humans.

    I am indeed awaiting the big news headlines that will for some reason catch everyone by surprise when a LLM comes along that’s trained to “translate” machine code into a nice easily-comprehensible high-level programming language. It’s going to be a really big development, even though it doesn’t make programs legally “open source” it’ll make it all source available.

    • BigDanishGuy@sh.itjust.works
      link
      fedilink
      arrow-up
      4
      ·
      2 months ago

      I am indeed awaiting the big news headlines that will for some reason catch everyone by surprise when a LLM comes along that’s trained to “translate” machine code into a nice easily-comprehensible high-level programming language.

      Another commenter dismissed the idea outright. WTF… What is implausible about an LLM that takes decompiled code, deals with the obfuscating bs, recognizes known libraries, and organizes the remaining code. That will totally happen, if it hasn’t already been done.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        3
        ·
        2 months ago

        There’s a lot of outright rejection of the possibilities of AI these days, I think because it’s turning out to be so capable. People are getting frightened of it and so jump to denial as a coping mechanism.

        I recalled reading about an LLM that had been developed just a couple of weeks ago for translating source code into intermediate representations (a step along the way to full compilation) and when I went hunting for a reference to refresh my memory I found this article from March about exactly what’s being discussed here - an LLM that translates assembly language into high-level source code. Looks like this one’s just a proof of concept rather than something highly practical, but prove the concept it does.

        I wonder if there are research teams out there sitting on more advanced models right now, fretting about how big a bombshell it’ll be when this gets out.

      • orcrist@lemm.ee
        link
        fedilink
        arrow-up
        2
        ·
        2 months ago

        It’s easy to say that we should throw AI at a problem and in a few years it will solve it, but most of the time it doesn’t actually work that way. If you think about the Turing Test itself, where the history goes back to the 1950s, how many decades did it take for us to get to anything that could reasonably come close to passing it? So anytime you think to yourself that one of these days AI is going to get there, remember that one of these days might actually be a half century from now.

        The other aspect to this challenge, or rather specifically with regards to this challenge, is that the setup involves humans organizing code in a certain way according to some kind of reasoning that the authors know about, and then that being compiled away, and then another computer program trying to get back what the original authors might have been thinking when they designed the thing originally. That’s a steep hill to climb. Can it be done on a small scale? It certainly can. On a large scale? Don’t hold your breath.

    • Toes♀@ani.social
      link
      fedilink
      arrow-up
      5
      ·
      2 months ago

      I have a bunch of 16-bit applications that I would love to be able to do that with. Mostly dos and windows 3.1 games.

      • subignition@fedia.io
        link
        fedilink
        arrow-up
        4
        ·
        2 months ago

        You might actually consider dipping your toes into trying to learn how to analyze/reverse those yourself. Relatively speaking, software that old can sometimes be easier to reverse.

        • Toes♀@ani.social
          link
          fedilink
          arrow-up
          2
          ·
          2 months ago

          Yeah I’m not unfamiliar (still a novice though) with the process and mostly used it circumvent something obnoxious or tweak save files. Just takes a lot of effort when you’re just looking to spend a couple hours playing a game before bed.

          I’m currently experiencing a frustrating bug in dolphin and I’m being tempted to learn enough about it. My MIPS buddy won’t help me with it because he thinks it’s a waste of time.

          I like LLMs for the time it saves you to do something laborious or mundane. One day we’ll have general ai fingers crossed

          ~Love the toes pun

          • subignition@fedia.io
            link
            fedilink
            arrow-up
            2
            ·
            2 months ago

            My apologies for preaching to the choir. (And I didn’t notice your username when I wrote that, LOL. Happy accident.)

  • hendrik@palaver.p3x.de
    link
    fedilink
    English
    arrow-up
    11
    ·
    2 months ago

    To add on “the first programs” written in assembler: You’ll find they have some structure to them. As they’ve been written by humans. You can recognize the conditions, loops, … And they’ve grouped similar code together… A compiler does none of that. It’ll be happy to make a complete mess, re-organize it, combine chunks, do away with loops and other stuff if it can be done more efficiently another way. It’ll be more optimal (ideally) but generally unrecognizable to the human eye what happens in that code. And it might be one big pile of instructions, jumping around arbitrarily without any subdivision into chunks that would be logical to go together.

  • xmunk@sh.itjust.works
    link
    fedilink
    arrow-up
    3
    arrow-down
    1
    ·
    2 months ago

    Assuming you have all the source code… it is possible. It’s usually a huge pain in the ass though and software is so complicated that it’s extremely difficult to get anything useful.

  • foggy@lemmy.world
    link
    fedilink
    arrow-up
    75
    arrow-down
    3
    ·
    edit-2
    2 months ago

    It is not. idk who told you it was.

    Disassembling an executable is trivial to do. Everything is open source if you can read assembly. Obfuscation be damned.

    • Lemminary@lemmy.world
      link
      fedilink
      arrow-up
      14
      ·
      2 months ago

      I’ve used a decompiler to peek at the source code of an app written in Visual Basic I wanted to recreate as a browser addon. It was mostly successful but some variable and function names were messed up.

      • peopleproblems@lemmy.world
        link
        fedilink
        arrow-up
        29
        ·
        2 months ago

        Variable names, class names, package structure, method names, etc. won’t normally be maintained in the disassembled code. They are meaningless to the CPU, and just a series of memory addresses. In cases where you have method names being mentioned, it’s likely a syscall, and it’s calling a method from an existing library. I’m not familiar with VB, but at least in .Net and .Net Framework, this would be something like the System.Collections.Generic providing the implementation for List<string> and when .Sort() is called, it makes the syscall to that compiled .dll.

          • peopleproblems@lemmy.world
            link
            fedilink
            arrow-up
            24
            ·
            2 months ago

            Instead of just getting the down votes, I’ll explain why that wouldnt work.

            1. The AI itself cannot decompile it without the same tools I would use. The AI would then end up with the same starting spot I have.
            2. Current LLMs do not know how to interpret code logic, and would likely make mistakes in Syscalls, register addresses, and instructions.
            3. Assembly languages themselves have nothing further than instruction sets. I’m sure there are ways to organize it in the super rare case of actually writing assembly, but not to the effect of object oriented or functional programming.

            Lastly, other comments have pointed out decompiled code is extremely expensive to analyze. The output from whatever we decompile would easily exceed the input limits for all existing LLMs.

            • Naich@lemmings.world
              link
              fedilink
              arrow-up
              2
              arrow-down
              2
              ·
              2 months ago

              Thanks. I was thinking that you could have an AI “looking over the shoulder” of a compiler, seeing what comes out for the code going in to it. Basically training it to spot sequences in compiled code in order to guess the instructions that compiled into that code.

    • LavenderDay3544@lemmy.world
      link
      fedilink
      arrow-up
      17
      arrow-down
      1
      ·
      2 months ago

      The hard part isn’t reading assembly. The hard part is figuring out why it’s doing what it’s doing with no comments or function names or anything useful to help.

      This is like saying if you can read English you can understand an advanced math or physics paper written in English without having any knowledge or context of those subjects.

    • Thorry84@feddit.nl
      link
      fedilink
      arrow-up
      56
      ·
      2 months ago

      Well decompiling is only one step in the reverse engineering process. I would recommend taking a look at the Legend of Zelda: Ocarina of Time decompile projects. They reversed engineered the whole thing, which took years and was a team effort.

      In the end they got perfectly readable source code, fully documented. And the most amazing thing is, when compiled with the right compiler and right flags, it recreates the original rom perfectly.

      I would also recommend a YouTuber called Kaze. He’s been working on Mario 64 for years, re-writing large parts of the engine to get some pretty cool stuff going.

  • 2484345508@lemy.lol
    link
    fedilink
    English
    arrow-up
    7
    ·
    2 months ago

    In addition to the other comments that explained it well… Back in the day, that process was easier in part because executable files had far fewer instructions.

  • ℕ𝕖𝕞𝕠@midwest.social
    link
    fedilink
    arrow-up
    10
    ·
    2 months ago

    We can and have done this, but there’s not much gain, which is why it’s mostly done by hobbyists to their favorite older software whose parent company went bust. It’s especially common for older games.

  • Björn Tantau@swg-empire.de
    link
    fedilink
    arrow-up
    4
    ·
    2 months ago

    It’s not impossible. It’s being done all the time. It’s just tedious complicated work. So if you don’t have someone willing to invest their time and expertise it won’t be done for most stuff.

  • ImplyingImplications@lemmy.ca
    link
    fedilink
    arrow-up
    9
    ·
    2 months ago

    It’s always possible to convert binary (machine code) into assembly since they’re basically the same thing. Assembly is just human readable binary.

    It’s not possible to generate high level source code from assembly the same way you can’t generate a recipe for a cake by analyzing the composition of the cake. It doesn’t matter how smart you are, the temperature the oven was set at and time the cake spent in the oven can’t be found in the molecular structure of a cake.