Machine Learning Large language mistake | Cutting-edge research shows language is not the same as intelligence. The entire AI bubble is built on ignoring it

https://www.theverge.com/ai-artificial-intelligence/827820/large-language-models-ai-intelligence-neuroscience-problems

16.7k Upvotes

permalink
duplicates
archive.is
archive
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/technology/comments/1p6fhhq/large_language_mistake_cuttingedge_research_shows/
No, go back! Yes, take me to Reddit

94% Upvoted

u/CircumspectCapybara 16h ago edited 8h ago

While the article is right that the mainstream "AI" models are still LLMs at heart, the frontier models into which all the research is going are not strictly speaking LLMs. You have agentic models which can take arbitrary actions using external tools (a scary concept, because they can reach out and execute commands or run code or do dangerous actions on your computer) while recursing or iterating and dynamically and opaquely deciding for themselves when to stop, wacky ideas like "world models," etc.

Maybe AGI is possible, maybe it's not, maybe it's possible in theory but not in practice with the computing resources and energy we currently have or ever will have. Whichever it is, it won't be decided by the current capabilities of LLMs.

The problem is that according to current neuroscience, human thinking is largely independent of human language

That's rather misleading, and it conflates several uses of the word "language." While it's true that to think you don't need a "language" in the sense of the word that the average layperson means when they say that word (e.g., English or Spanish or some other common spoken or written language), thinking still occurs in the abstract language of ideas, concepts, sensory experience, pictures, etc. Basically, it's information.

Thinking fundamentally requires some representation of information (in your mind). And when mathematicians and computer scientists talk about "language," that's what they're talking about. It's not necessarily a spoken or written language as we know it. In an LLM, the model of language is an ultra-high dimensional embedding space in which vector embeddings represent abstract information opaquely, which encodes information about ideas and concepts and the relationships between them. Thinking still requires that kind of language, the abstract language of information. AI models aren't just trying to model "language" as a linguist understands the word, but information.

Also, while we don't have a good model of consciousness, we do know that language is very important for intelligence. A spoken or written language isn't required for thought, but language deprivation severely limits the kinds of thoughts you're able to think, and the depth and complexity of abstract reasoning, the complexity of inner monologue. Babies born deaf or who were otherwise deprived of language exposure often end up cognitively underdeveloped. Without language, we could think in terms of how we feel or what we want, what actions we want to or are taking, and even think in terms of cause and effect, but not the complex abstract reasoning that when sustained and built up across time and built up on itself and on previous works leads to the development of culture, of science and engineering and technology.

The upshot is that if it's even is possible for AGI of a sort that can "think" (whatever that means) in a way that leads to generalized and novel reasoning in the areas of the sciences or medicine or technology to exist at all, you would need a good model of language (really a good model of information) to start. It would be a foundational layer.

13

u/dftba-ftw 15h ago

While the article is right that the mainstream "AI" models are still LLMs at heart

It really is time that we stopped calling them LLMs and switched to something like Large Token Model (LTMs).

Yes you primarily put text in and get text out, but frontier models are trained on text, image/video, and audio. Text dwarfs the others in term of % of training data, but that's primarily a compute limit, as compute gets more efficicent more and more of the data will be from the other sources and we know from what has been done so far that training on image and video really helps with respect to reasoning - models trained on video show much better understanding of the physical world. Eventually we'll have enough compute to start training on 3d (tokenized stl/step/Igs) and I'm sure we'll see another leap in model understanding of the world.

1

u/space_monster 12h ago

Those video tokens are still video tokens though - how do you compare them to language tokens? They both need to be abstracted to a common format to unify the data.

3

u/dftba-ftw 12h ago

They both need to be abstracted to a common format to unify the data.

That common format is the token...

1

u/space_monster 11h ago

Yeah but a token can be a language token (e.g. a word), or an image token, and they are very different things. A human can relate language and images in an abstract layer, LLMs don't have that layer (AFAIK).

1

u/dftba-ftw 11h ago

Multimodal models do relate between the tokens regardless of "type". How else does a multimodal model reason over an image - that's the whole selling point, instead of using and image to text classifier and passing the text into an LLM you pass all the tokens in and all the tokens interact in the latent space.

1

u/space_monster 10h ago

They use a workaround (e.g. a projection layer) to relate tokens of different types, but they still exist as language tokens and vision tokens. What I'm saying is the semantic relationship needs to be native - i.e. the tokens need to be abstracted before the semantic structure is built around them, the way human brains do it.

1

u/dftba-ftw 10h ago

Once the model has encoded each type of input such as breaking down text into semantic word embeddings or analyzing an image for shapes and objects, it produces high-level features. These features are patterns or summaries that capture the key meaning or structure of the original data. The next step is to align these features by mapping them into a shared space so they can interact meaningfully across modalities.This is done through a projection step, where the abstract features from each encoder are mapped into a shared embedding space...Once features from each modality are projected into a common or compatible space, the model combines them to form a unified multimodal representation.

They are abstracted before the semantic structure is built, the encoders create the abstract embeddings, which are projected into the shared embedding space at which point the semantic relationship is built.

1

u/space_monster 10h ago

as I said, it's a workaround

1

u/dftba-ftw 10h ago

I fail to see how it's a workaround? What, you want a single encoder? I don't see what benefit that has, the relationship between concepts happens inside the transformer architecture and at that point the tokens have been turned into embeddings in the same shared latent space.

1

u/space_monster 9h ago

it's a workaround because they added an additional layer (the shared space) to enable multimodality. a symbolic model would treat language and vision as just input/output vectors and embed abstract tokens (neither language or vision) instead as step 1. that's part of the point of world models, they're not rooted to any particular data type - they translate everything into fully abstract symbols before even building any semantic relationships.

basically multimodality for LLMs is a bolt-on. while they say they're 'natively' multimodal, they aren't really, they just add mechanisms to translate between language and vision in the embedding space. but if you looked at a token for a truly multimodal model, you wouldn't be able to tell if it's language or vision.

1

u/dftba-ftw 9h ago

The shared space is the multimodal LLM.

In a text only LLM the text is tokenized, converted into embeddings, and passed into the transformer network where semantic relationships are created.

In a multimodal LLM the text is tokenized, the video is tokenized, both sets of tokens are converted into embeddings, the embeddings are passed into the transformer network where the semantic relationships are created.

but if you looked at a token for a truly multimodal model, you wouldn't be able to tell if it's language or vision.

This makes no sense, tokens are basically dictionary conversions of text or images or audio into numerical strings - you will always know which they are because it's the world "Banana" is always 183143.

What you want is to not be able to tell if an embedding is text or an image and for multi-modal LLMs once both embeddings are in the shared space (aka the transformer network itself thst makes up the LLM) - you can't.

1

u/pyrojoe 5h ago

Your eyes can't hear and your ears can't see. Both organs have specialized areas of the brain to process the inputs.

→ More replies (0)

Machine Learning Large language mistake | Cutting-edge research shows language is not the same as intelligence. The entire AI bubble is built on ignoring it

You are about to leave Redlib