• brucethemoose@lemmy.world
    cake
    link
    fedilink
    English
    arrow-up
    4
    ·
    12 hours ago

    Mistral likely does “prompt enhancement,” aka feeding your prompt to an LLM first and asking it to expand it with more words.

    So internally, a Mistral text LLM is probably writing out “sure! Here’s a long prompt with no dog: …” and then that part is fed to the image generator.

    Other “LLMs” are truly multimodal and generate image output, hence they still get the word “dog” in the input.