Language Models vs. Image models

by Helena

One counterintuitive thing about language vs. image models is: the size of language models are exponentially larger than image models. Outside of models, we know images are bigger files than text, and text is the easiest to store. The case is reversed for large models.

An LLM easily goes by billions of parameters. A diffusion image model is usually hundreds of millions max. I could run the earliest stable diffusion locally on a random PC with a GPU that could not run Cyberpunk 2077. I would have no RAM to even give it a go at the LLMs nowadays on my 3080 gaming PC.

The patterns in text and in language are so infinitely intricate. The way I think about it, is text encodes human civilization. LLMs are based mostly on English, but people are trying to expand multilingual capabilities - these languages contain not only ones spoken by humans, but also those spoken by computers (code). So this is the ultimate bridge, or rather, a merger. I don’t see any other way human-computer interaction will truly reach its end game.

Images represent a simpler form of language. We can communicate with these visual renderings regardless of the languages we speak. The models go by pixels, arranged neatly in a square space, 512x512, 1024x1024, 2K, 4K, 8K, all the same neatness. These arrays get really fun and creative.

So who’s RAGing images?

Get a free consulting session today

Our offices

  • Palo Alto / San Francisco
    California, USA
  • Hong Kong / Shanghai
    China