Bill proposed to outlaw downloading Chinese AI models.

schizoidman@lemm.ee · 2 months ago

Bill proposed to outlaw downloading Chinese AI models.

thingsiplay@beehaw.org · 2 months ago

None of the code and training data is available. Its just the usual Huggingface thing, where some weights and parameters are available, nothing else. People repeat DeepSeek (and many other) Ai LLM models being open source, but they aren’t.

They even have a Github source code repository at https://github.com/deepseek-ai/DeepSeek-R1 , but its only an image and PDF file and links to download the model on Huggingface (plus optional weights and parameter files, to fine tune it). There is no source code, and no training data available. Also here is an interesting article talking about this issue: Liesenfeld, Andreas, and Mark Dingemanse. “Rethinking open source generative AI: open washing and the EU AI Act.” The 2024 ACM Conference on Fairness, Accountability, and Transparency. 2024

Gamers_mate@beehaw.org · 2 months ago

Damn that sucks it should be open source. Let people fork and optimize it so it uses less electricity as possible.

P03 Locke@lemmy.dbzer0.com · edit-2 2 months ago

This literally took one click: https://github.com/deepseek-ai

Stop spreading FUD.

thingsiplay@beehaw.org · 2 months ago

deleted by creator

thingsiplay@beehaw.org · 2 months ago

Can you actually explain what in my reply is “Fear, uncertainty, and doubt”? Did you actually read it? I even linked to the specific github repository, which is basically empty. You just link to an overview, which does not point to any source code.

Please explain whats FUD and link to the source code, otherwise do not call people FUD if you don’t know what you are talking about.

P03 Locke@lemmy.dbzer0.com · 2 months ago

You’re purposely being obtuse, and not arguing in good faith. The source code is right there, in the other repos owned by the deepseek-ai user.

thingsiplay@beehaw.org · 2 months ago

What are you talking about? What bad faith are you saying to me? I ask you to show me the repository that contains the source code. There is none. Please give me a link to the repo you have in mind. Where is the source code and training data of DeepSeek-R1? Can we build the model from source?

jarfil@beehaw.org · 2 months ago

Where’s the training data?

Crotaro@beehaw.org · 2 months ago

Does open sourcing require you to give out the training data? I thought it only means allowing access to the source code so that you could build it yourself and feed it your own training data.

jarfil@beehaw.org · 2 months ago

Open source requires giving whatever digital information is necessary to build a binary.

In this case, the “binary” are the network weights, and “whatever is necessary” includes both training data, and training code.

DeepSeek is sharing:

NO training data
NO training code
instead, PDFs with a description of the process
binary weights (a few snapshots)
fine-tune code
inference code
evaluation code
integration code

In other words: a good amount of open source… with a huge binary blob in the middle.

Crotaro@beehaw.org · 2 months ago

Thanks for the explanation. I don’t understand enough about large language models to give a valuable judgement on this whole Deepseek happening from a technical standpoint. I think it’s excellent to have competition on the market and it feels that the US’ whole “But they’re spying on you and being a national security risk” is a hypocritical outcry when Facebook, OpenAI and the like still exist.

What do you think about Deepseek? If I understood correctly, it’s being trained on the output of other LLMs, which makes it much more cheap but, to me it seems, also even less trustworthy because now all the actual human training data is missing and instead it’s a bunch of hallucinations, lies and (hopefully more often than not) correctly guessed answers to questions made by humans.

jarfil@beehaw.org · edit-2 2 months ago

There are several parts to the “spying” risk:

Sending private data to a third party server for the model to process it… well, you just sent it, game over. Use local models, or machines (hopefully) under your control, or ones you trust (AWS? Azure? GCP?.. maybe).

All LLM models are a black box, the only way to make an educated guess about their risk, is to compare the training data and procedure, to the evaluation data of the final model. There is still a risk of hallucinations and deceival, but it can be quantified to some degree.

DeepSeek uses a “Mixture of Experts” approach to reduce computational load… which is great, as long as you trust the “Experts” they use. Since the LLM that was released for free, is still a black box, and there is no way to verify which “Experts” were used to train it, there is also no way to know whether some of those “Experts” might or might not be trained to behave in a malicious way under some specific conditions. It could as easily be a Troyan Horse with little chance of getting detected until it’s too late.

it’s being trained on the output of other LLMs, which makes it much more cheap but, to me it seems, also even less trustworthy

The feedback degradation of an LLM happens when it gets fed its own output as part of the training data. We don’t exactly know what training data was used for DeepSeek, but as long as it was generated by some different LLM, there would be little risk of a feedback reinforcement loop.

Generally speaking, I would run the DeepSeek LLM in an isolated environment, but not trust it to be integrated in any sort of non-sandboxed agent. The downloadable smartphone app, is possibly “safe” as long as you restrict the hell out of it, don’t let it access anything on its own, and don’t feed it anything remotely sensitive.

Crotaro@beehaw.org · 2 months ago

Thank you a lot for the load of information! I just now got to reading it all. I was very skeptical about the fact that it is fed by the output of other LLMs but the way you explain it makes sense to me that it might not be that much of a problem. I guess a super blunt analogy could be “It’s only incest if it’s your children” lol

teawrecks@sopuli.xyz · 2 months ago

Is there any good LLM that fits this definition of open source, then? I thought the “training data” for good AI was always just: the entire internet, and they were all ethically dubious that way.

What is the concern with only having weights? It’s not abritrary code exectution, so there’s no security risk or lack of computing control that are the usual goals of open source in the first place.

To me the weights are less of a “blob” and more like an approximate solution to an NP-hard problem. Training is traversing the search space, and sharing a model is just saying “hey, this point looks useful, others should check it out”. But maybe that is a blob, since I don’t know how they got there.

jarfil@beehaw.org · 2 months ago

There are several “good” LLMs trained on open datasets like FineWeb, LAION, DataComp, etc. They are still “ethically dubious”, but at least they can be downloaded, analyzed, filtered, and so on. Unfortunately businesses are keeping datasets and training code as a competitive advantage, even "Open"AI stopped publishing them when they saw an opportunity to make money.

What is the concern with only having weights? It’s not abritrary code exectution

Unless one plugs it into an agent… which is kind of the use we expect right now.

Accessing the web, or even web searches, is already equivalent to arbitrary code execution: an LLM could decide to, for example, summarize and compress some context full of trade secrets, then proceed to “search” for it, sending it to wherever it has access to.

Agents can also be allowed to run local commands… again a use we kind of want now (“hey Google, open my alarms” on a smartphone).

teawrecks@sopuli.xyz · 2 months ago

Those security concerns seem completely unrelated to the model, though. You can have a completely open source model that fits all those requirements, and still give it too much unfettered access to important resources with no way of actually knowing what it will do until it tries.

jarfil@beehaw.org · 2 months ago

While unfettered access is bad in general, DeepSeek takes it a step farther: the Mixture of Experts approach in order to reduce computational load, is great when you know exactly what “Experts” it’s using, not so great when there is no way to check whether some of those “Experts” might be focused on extracting intelligence under specific circumstances.

P03 Locke@lemmy.dbzer0.com · 2 months ago

There are several “good” LLMs trained on open datasets like FineWeb, LAION, DataComp, etc.

Then use those as training data. You’re too caught up on this exacting definition of open source that you’ll completely ignore the benefits of what this model could provide.

an LLM could decide to, for example, summarize and compress some context full of trade secrets, then proceed to “search” for it, sending it to wherever it has access to.

That’s not how LLMs work, and you know it. A model of weights is not a lossless compression algorithm.

Also, if you’re giving an LLM free reign to all of your session tokens and security passwords, that’s on you.

jarfil@beehaw.org · 2 months ago

That’s not how LLMs work, and you know it. A model of weights is not a lossless compression algorithm.

https://www.piratewires.com/p/compression-prompts-gpt-hidden-dialects

if you’re giving an LLM free reign to all of your session tokens and security passwords, that’s on you.

There are more trade secrets than session tokens and security passwords. People want AI agents to summarize their local knowledge base and documents, then expand it with updated web searches. No passwords needed when the LLM can order the data to be exfiltrated directly.

P03 Locke@lemmy.dbzer0.com · 2 months ago

Nobody releases training data. It’s too large and varied. The best I’ve seen was the LAION-2B set that Stable Diffusion used, and that’s still just a big collection of links. Even that isn’t going to fit on a GitHub repo.

Besides, improving the model means using the model as a base and implementing new training data. Specialize, specialize, specialize.

thingsiplay@beehaw.org · 2 months ago

Nobody releases training data. It’s too large and varied.

That’s why its not Open Source. They do not release the source and its impossible to build the model from source.

jarfil@beehaw.org · 2 months ago

What about these? Dozens of TB here:

https://huggingface.co/HuggingFaceFW

There is also a LAION-5B now, and several other datasets.

P03 Locke@lemmy.dbzer0.com · 2 months ago

Wow, it’s like you didn’t even read my post.