Cognitive AI Universe – Episode 2


 From Model Limitations to Future Potentials

Hello everyone, and welcome back to the second episode of Cognitive AI Universe.
In the previous session, we explored the technical foundation at the very bottom of the AI stack: data, models, training, and quantization mechanisms. Together, these components built the first real bridge that allowed machines to perceive aspects of our human world.

But as ordinary users, what we truly interact with are not the underlying systems, but the diverse range of AI model products built on top of them. At first glance, these products often bring us a sense of wonder. Yet, that excitement is quickly followed by a realization: models have many inherent limitations.

One researcher once summarized these into four categories. Due to time constraints, today we’ll focus on the first three. And while many of these issues have already been alleviated through technical progress, I personally believe that what’s more valuable than simply listing “limitations” is exploring the new mechanisms that emerged to overcome them—because these mechanisms point toward future possibilities.

So, let’s dive in.


1. The Limitation of Data

The first limitation can be summarized as data constraints, which can be broken down into two dimensions:

  • Timeliness of data

  • Coverage (or vision) of data

During training, the process is highly cyclical—like refining porcelain. This means models are always trained on data up to a certain cutoff date. Anything that happens afterward is simply unknown.

For example, I once asked GPT about when Zhao Xintong won his world championship title. But because the model’s data cutoff was before his victory, the model had no idea—and gave me an incorrect answer. That’s a classic timeliness limitation.

Another personal example: I’ve been playing a very niche game for years. Out of curiosity, I asked GPT for strategies. Unsurprisingly, because the game never made it into the training data, the model produced nonsense. That’s the coverage limitation.

So, how do we overcome this?
Enter RAG (Retrieval-Augmented Generation). Its entire purpose is to supplement models with real-time and external data.

When I repeated the Zhao Xintong question using RAG-enabled GPT, the model instantly answered correctly: “He won the championship this May at the age of 19.”

Beyond timeliness, RAG also empowers models with personalized knowledge bases. For example, I’ve been working on a project where I structure my past decade of video scripts into a private database. By connecting this with a large model via RAG, I can instantly query my own archive—making the model more personally valuable to me.

The underlying logic of RAG has two core traits:

  1. Outside-in supplementation – pulling in external data dynamically

  2. Real-time handling of unstructured information – and feeding it back into live generation

This means the user experience improves dramatically. Imagine asking your AI assistant to remind you to pick someone up at the airport. In the past, it would simply set a reminder. But with RAG, it could notice that a key road is temporarily closed and suggest you leave earlier—providing context-aware, situationally valuable advice.

Industries such as customer service are already applying this at scale. For instance, if a customer inquires about a rare product, an RAG-enabled system can pull the exact manual from the company’s knowledge base and generate a user-friendly response on the spot.

In short: RAG solves both timeliness and coverage issues by extending the model’s “knowledge horizon.”


2. The Limitation of Memory

Now, let’s imagine interacting with a model that has no memory at all.

  • You ask: “Who founded the Ming Dynasty?”“Zhu Yuanzhang.”

  • Then you ask: “What was his original name?” → The model is lost.

Why? Because without memory, the model doesn’t know “he” refers to Zhu Yuanzhang. To fix this, you’d need to repeat the full question: “What was Zhu Yuanzhang’s original name?”

This shows why context windows (short-term memory) are so important. If the model can “remember” just 20 tokens, suddenly it connects the dots between questions. Modern models now have million-token windows—equivalent to hundreds of thousands of words. This enables far deeper, more coherent interactions.

But is infinite memory possible?
Theoretically yes, but not by infinitely extending context windows. A more efficient way is to treat the short-term window as working memory, and then offload older interactions to external databases—retrievable later through RAG. This effectively simulates long-term or even “permanent” memory.

Yet, memory isn’t the core issue. The deeper point is contextual association.
If text can be linked through vectorized context, then so can other data types:

  • Behavioral data

  • Geolocation data

  • Physiological indicators

  • Environmental signals (altitude, air pressure, etc.)

The richer the data types, the finer the granularity, and the stronger AI’s ability to “simulate memory” across life contexts.

Think about it: most of our human memories aren’t chat logs—they’re sounds, feelings, even smells. If AI could associate across these dimensions, its “memory” would transcend human-like recall and become predictive. For example:

  • If AI notices you’re late leaving home, it could automatically order your usual coffee so it’s ready at the office.

  • If it sees you working late, it might choose calming music when you ask it to play something.

And what’s the most comprehensive sensor of our lives today? The smartphone.
Combine AI with the phone’s multidimensional data streams, and the potential of contextual intelligence expands almost without limit.


3. The Limitation of Perception

Finally, let’s talk about perception limits.

Large language models are trained on text, while vision models are trained on images. Separately, they work—but they don’t naturally “share a brain.” This means one model might see an image but can’t describe it, while another can speak fluently but can’t see.

The breakthrough comes from vector alignment across modalities.
When text embeddings and image embeddings are precisely aligned in the same vector space, AI can both see and describe the world—like putting eyes and a mouth in the same head. This is the foundation of multimodal models (VLMs).

If the alignment is poor, we get errors: the vision module sees “Tony Leung’s melancholic eyes,” but the language module outputs “a man facing left.” That’s the danger of weak cross-modal integration.

When alignment works, AI gains true environmental perception. Compare two examples:

  • Low perception (script logic): A delivery robot in a hotel says, “Please step aside, I’m exiting the elevator,” even when no one is there. It’s just following pre-coded triggers.

  • High perception (multimodal): With vision + language fused, the robot knows when someone is present, greets them by name, and adapts its behavior dynamically.

The same applies to autonomous driving. Without multimodal perception, many systems today are still glorified “script triggers”—initiating lane changes when you signal, but relying on the human driver for actual environmental judgment. True autonomy requires fusing multiple modalities (vision, LiDAR, sensors) into a unified perception and reasoning system.


Closing Thoughts

Today, we looked at three fundamental limitations of early AI models:

  1. Data limitations → solved by RAG

  2. Memory limitations → addressed by context windows + external retrieval

  3. Perception limitations → addressed by multimodal vector alignment

Each limitation gave birth to mechanisms that didn’t just fix problems, but opened new frontiers.

In the next episode, we’ll continue this journey by discussing function calling and MCP (Model Context Protocol)—and how they extend AI’s agency beyond perception into real-world action.

Stay tuned—the Cognitive AI Universe is just beginning to expand.