AI’s become so invasively popular and I’ve seen more evidence of its ineffectiveness than otherwise, but what I dislike most about it is that many run on datasets of stolen data for the sake of profitability à la OpenAI and Deepseek

https://mashable.com/article/openai-chatgpt-class-action-lawsuit https://petapixel.com/2025/01/30/openai-claims-deepseek-took-all-of-its-data-without-consent/

Are there any AI services that run on ethically obtained datasets, like stuff people explicitly consented to submitting (not as some side clause of a T&C), data bought by properly compensating the data’s original owners, or datasets contributed by the service providers themselves?

  • Pamasich@kbin.earth
    link
    fedilink
    arrow-up
    5
    ·
    6 days ago

    Switzerland announced a new LLM project which might be of interest here.

    Here’s a German article on it. If you’re okay with a Reddit link, here’s a translation.

    Some points on it:

    • fully open source in its entirety — source code, model weights, and training data will all be publically released.
    • licensed under Apache 2.0
    • compliant with Swiss data protection laws, copyright law, and the EU AI act
    • respects crawler opt-outs on websites

    While nothing there explicitly says the data is ethically sourced, we’ll be able to tell based on the opensource training data, and I assume copyright law takes care of stuff like books being used (though idk if the AI has a way to determine the license of web content, or if it fully relies on opt-outs there).