Interesting fact: OpenAI’s Whisper was developed only because the company had already copied, analyzed, and used every usable text on the internet to train its LLM. With Whisper they were able to transcribe the audio tracks of YouTube videos and use those for training as well.

indiatimes.com writes in “How tech giants cut corners to harvest data for AI

The artificial intelligence lab had exhausted every reservoir of reputable English-language text on the internet as it developed its latest AI system. It needed more data to train the next version of its technology – lots more.

I predict: podcasts have probably already been analyzed too. Chats and data from assistants like Alexa or Siri could be next.

The text was automatically translated from German into English. The German quotations were also translated in sense.