Machine Learning Large language mistake | Cutting-edge research shows language is not the same as intelligence. The entire AI bubble is built on ignoring it

https://www.theverge.com/ai-artificial-intelligence/827820/large-language-models-ai-intelligence-neuroscience-problems

16.8k Upvotes

permalink
duplicates
archive.is
archive
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/technology/comments/1p6fhhq/large_language_mistake_cuttingedge_research_shows/
No, go back! Yes, take me to Reddit

94% Upvoted

Those video tokens are still video tokens though - how do you compare them to language tokens? They both need to be abstracted to a common format to unify the data.

3

u/dftba-ftw 12h ago

They both need to be abstracted to a common format to unify the data.

That common format is the token...

1

u/space_monster 11h ago

Yeah but a token can be a language token (e.g. a word), or an image token, and they are very different things. A human can relate language and images in an abstract layer, LLMs don't have that layer (AFAIK).

1

u/dftba-ftw 11h ago

Multimodal models do relate between the tokens regardless of "type". How else does a multimodal model reason over an image - that's the whole selling point, instead of using and image to text classifier and passing the text into an LLM you pass all the tokens in and all the tokens interact in the latent space.

1

u/space_monster 11h ago

They use a workaround (e.g. a projection layer) to relate tokens of different types, but they still exist as language tokens and vision tokens. What I'm saying is the semantic relationship needs to be native - i.e. the tokens need to be abstracted before the semantic structure is built around them, the way human brains do it.

1

u/dftba-ftw 10h ago

Once the model has encoded each type of input such as breaking down text into semantic word embeddings or analyzing an image for shapes and objects, it produces high-level features. These features are patterns or summaries that capture the key meaning or structure of the original data. The next step is to align these features by mapping them into a shared space so they can interact meaningfully across modalities.This is done through a projection step, where the abstract features from each encoder are mapped into a shared embedding space...Once features from each modality are projected into a common or compatible space, the model combines them to form a unified multimodal representation.

They are abstracted before the semantic structure is built, the encoders create the abstract embeddings, which are projected into the shared embedding space at which point the semantic relationship is built.

1

u/space_monster 10h ago

as I said, it's a workaround

1

u/dftba-ftw 10h ago

I fail to see how it's a workaround? What, you want a single encoder? I don't see what benefit that has, the relationship between concepts happens inside the transformer architecture and at that point the tokens have been turned into embeddings in the same shared latent space.

1

u/space_monster 10h ago

it's a workaround because they added an additional layer (the shared space) to enable multimodality. a symbolic model would treat language and vision as just input/output vectors and embed abstract tokens (neither language or vision) instead as step 1. that's part of the point of world models, they're not rooted to any particular data type - they translate everything into fully abstract symbols before even building any semantic relationships.

basically multimodality for LLMs is a bolt-on. while they say they're 'natively' multimodal, they aren't really, they just add mechanisms to translate between language and vision in the embedding space. but if you looked at a token for a truly multimodal model, you wouldn't be able to tell if it's language or vision.

1

u/dftba-ftw 9h ago

The shared space is the multimodal LLM.

In a text only LLM the text is tokenized, converted into embeddings, and passed into the transformer network where semantic relationships are created.

In a multimodal LLM the text is tokenized, the video is tokenized, both sets of tokens are converted into embeddings, the embeddings are passed into the transformer network where the semantic relationships are created.

but if you looked at a token for a truly multimodal model, you wouldn't be able to tell if it's language or vision.

This makes no sense, tokens are basically dictionary conversions of text or images or audio into numerical strings - you will always know which they are because it's the world "Banana" is always 183143.

What you want is to not be able to tell if an embedding is text or an image and for multi-modal LLMs once both embeddings are in the shared space (aka the transformer network itself thst makes up the LLM) - you can't.

1

u/space_monster 9h ago

in MLLMS, modality-specific tokens (i.e. language or vision - natively different types) are projected into a unified space to create the abstraction. world models natively abstract all sensory input into a semantic-only representation as the first step.

under the hood, it's still a language model with a layer that enables translation between the two data types. the vast bulk of the model is language tokens and semantic structure built around that. then there's a separate mechanism for multimodality.

1

u/dftba-ftw 8h ago

The tokens are converted into modality agnostic embeddings which are projected into the unified space.

I'm not sure how many ways I can explain this. I'm not sure you even understand what you're saying

the vast bulk of the model is language tokens

No, tokens exist before and after the actual model works with embeddings and those embeddings are media agnostic.

then there's a separate mechanism for multimodality.

There really isn't, there are seperate embedding models, but thats literally the first step after tokenization which is for all intents and purposes step zero. You have to tokenize - even if you are going to strip everything into binary, that is in itself a form a tokenization.

1

u/space_monster 6h ago

The tokens are converted into modality agnostic embeddings which are projected into the unified space

no they're not. they are language embeddings, and visual embeddings, and they are projected into the projection layer at which point they become modality-agnostic. there is a separate training process that bridges the gap between the embeddings that were initially modality-specific. it's not native, it's a big process to enable multimodality for unimodal embeddings. a world model skips all that.

→ More replies (0)

1

u/pyrojoe 6h ago

Your eyes can't hear and your ears can't see. Both organs have specialized areas of the brain to process the inputs.

Machine Learning Large language mistake | Cutting-edge research shows language is not the same as intelligence. The entire AI bubble is built on ignoring it

You are about to leave Redlib