AGI Research Landscape 2023–Present
Imported verbatim from the shared ChatGPT research post.
The quest for Artificial General Intelligence (AGI) has coalesced around several interconnected subfields. Foundation models (large language and multimodal models) provide broad knowledge; agentic systems embed models in interactive, goal-driven loops; world models aim to give AI an internal simulation of reality; reasoning and planning seek systematic, multi-step cognition; memory and continual learning target lifelong adaptation; and alignment & safety focus on guiding these systems to behave as intended. These areas overlap heavily – for example, agents may use foundation models and world models together, and memory architectures underpin reasoning and learning – forming a complex research map of AGI pathways.
- Foundation models (LLMs, multimodal AI): Scale-driven neural networks trained on vast data (e.g. GPT-4, Meta’s LLaMA, Google DeepMind’s Gemini) that excel at language and increasingly vision, code, and speech. They have enabled breakthroughs (e.g. GPT-4’s multi-turn reasoning, DALL·E’s image synthesis, Meta’s open LLaMA models). Agentic systems: frameworks that connect LLMs to tools, memory, and environments (e.g. LangChain, AutoGPT, OpenAI’s plugins) so AI can act autonomously. World models: explicit internal models of physics and environment (e.g. latent-space simulators from model-based RL like Dreamer, Gato’s implicit representation) that let AI predict and plan without human input. Reasoning and planning: techniques that enable stepwise logic (chain-of-thought prompting, search algorithms, program synthesis) for tasks requiring planning or inference. Memory/Continual learning: architectures (memory networks, retrieval systems, adaptive parameters) that allow models to retain knowledge over time and learn online. Alignment and safety: methods (RL from human feedback, constitutional AI, interpretability, verification, adversarial training) ensuring AGI systems remain beneficial and controllable.
These subfields form a layered architecture: foundation models supply general knowledge, world models ground that knowledge in dynamics, agentic control loops combine them to act, and memory keeps the system growing. Alignment/safety applies across all layers, while reasoning/planning weaves through them to guide decision-making. This landscape is fluid and evolving, with heavy ongoing debate over which approaches will ultimately prevail.
1. Foundation Models (LLMs and Multimodal AI)
Leading approaches: Scaling up large neural models pretrained on massive unlabeled data (text, code, images, video) remains dominant. Notable systems include OpenAI’s GPT series (GPT-4, GPT-5, Sora), Anthropic’s Claude, Meta’s LLaMA 2/3, and Google DeepMind’s Gemini. These models use transformer-based architectures with billions of parameters, trained by next-token prediction or masked modeling. Multimodal extensions (e.g. OpenAI’s GPT-4o, LLaVA, DeepMind’s Flamingo) incorporate vision, speech or robotics inputs[16]. Training techniques include self-supervision, fine-tuning, and specialized curricula (like adding chain-of-thought prompts) to improve reasoning. Crucially, methods like Reinforcement Learning from Human Feedback (RLHF) are used to align model outputs with human preferences. Experimental evidence shows these LLMs exhibit emergent abilities: few-shot reasoning, code generation, translation, even causal inference in simple settings[3].
Core bottlenecks: Despite impressive fluency, current models face critical limitations. They hallucinate – confidently outputting false or ungrounded information. They lack true world grounding: as LeCun emphasizes, LLMs “don’t really understand the real world” and “can’t really reason or plan” beyond their training text[10]. Models also struggle with long-context or complex multi-step planning, and they inherit biases from their training data. The scaling law that once fueled AGI optimism now shows diminishing returns: OpenAI’s Sam Altman and others predicted relentless gains from scale, but experts (e.g. Ilya Sutskever) argue the “age of scaling” has plateaued[3][9]. Open issues include data curation (running out of new high-quality data), compute limits, and lack of modular, hierarchical structure. Models also remain opaque – even their chain-of-thought is often unfaithful[7], undermining trust.
Why unsolved: These problems persist because scaling alone cannot endow genuine understanding or generalization. Neural nets excel at interpolation in training distributions, but AGI demands extrapolation to truly novel contexts[13]. The “glass ceiling” effect (lack of new data and saturating gains[9]) suggests more compute will not magically solve deep conceptual gaps. The learning objective (next-token) is fundamentally statistical and does not guarantee causal, compositional reasoning. Moreover, interpretability research (to understand internals) is in its infancy, so we lack insight into exactly how models process knowledge[7].
Research hypotheses:
- Hypothesis 1 (Theoretical): Embedding explicit symbolic reasoning modules or knowledge graphs within LLM architectures could enable more reliable logic and fact-checking. For instance, integrating a symbol-manipulator (or calling a math solver/tool) when needed may improve correctness.
- Hypothesis 2 (Engineering): Combining LLMs with trainable neurosymbolic world models (e.g. differentiable simulators) will let the system self-debug outputs by checking consistency with physical laws or facts[2][10].
Evidence vs speculation vs heuristics:
- Empirical: Current LLMs perform extremely well on broad language tasks (e.g. GPT-4 exhibits human-level performance on many benchmarks[3]). Multimodal models now understand images as well as text.
- Theoretical: Many researchers (LeCun, Marcus) argue that true AGI requires additional ingredients like world modeling, memory, or new learning paradigms[10][11].
- Heuristics: In practice, developers use tricks (chain-of-thought prompting, retrieval-augmented generation) to compensate for shortcomings. For example, retrieval (RAG) adds a memory-like component, and frequent “safety fine-tuning” avoids some failure modes.
Divergent views: Proponents of pure scale (e.g. early OpenAI) saw larger LLMs as the path to AGI, while critics (LeCun, Gary Marcus) counter that scaling must be supplemented by new architectures. LeCun sharply states LLMs “are not a road towards what people call AGI”[10], whereas OpenAI’s leadership has at times emphasized “Sparks of AGI” in GPT-4[3]. Some private labs (Anthropic) follow OpenAI’s RLHF-heavy alignment, while others (Meta) favor open models and research on self-supervised learning. These disagreements reflect a split between statistical (scale-based) and cognitive (symbolic/world-model) philosophies of AGI.
2. Agentic Systems
Leading approaches: Agentic AI builds on foundation models by endowing them with autonomy and interactivity. Prominent methods include tool augmentation (LLMs connected to external APIs or simulators), multi-agent coordination, and active planning loops. Frameworks like LangChain (community tool) and Auto-GPT orchestrate LLMs to sequentially call tools (databases, calculators) to solve user requests. Recent architectures introduce self-reflection or planner-executor splits: one LLM generates plans, another executes them with safeguards[5]. In robotics and games, LLM controllers are paired with vision/motor modules (e.g. Meta’s RT-1 robots, or agents in ALFWorld/Voyager). Multi-agent LLM surveys find that running several specialized LLM agents in conversation (one as “expert”, one as “critic”, etc.) yields more robust solutions than one model alone[4][5].
Core bottlenecks: The biggest challenges are hallucination propagation and coordination complexity. When LLMs act autonomously, their tendency to make up facts can lead agents astray or amplify errors across steps. As Anthropic shows, even chain-of-thought isn’t fully trustworthy[7], so agents relying on internal “reasoning steps” may hide mistakes. Agent frameworks also struggle with multi-modal contexts (the agent may fail to interpret complex images or dynamic environments). Scaling to many agents introduces orchestration challenges: how to control who speaks or which agent takes action at each turn[5]. Evaluation and benchmarking of agents is still primitive – we lack standard tests for “agentic intelligence” beyond ad-hoc tasks[5].
Why unsolved: Creating reliable agents requires solving unsolved subproblems: faithful reasoning, long-term planning, and safe tool use. It also demands the system maintain state (memory) across interactions, which current LLMs cannot do internally. Ensuring emergent behaviors remain aligned is tricky without full transparency. Engineering limitations (finite compute for real-time loops) and the stability-plasticity dilemma (adapting on the fly without forgetting basics) make continuous learning hard[15][15]. Because agentic behavior combines perception, action and goals, it spans most of AI and inherits all unresolved issues of those fields.
Research hypotheses:
- Hypothesis 1 (Theoretical): A formal multi-agent learning framework (similar to MARL but with LLMs) can yield provably robust coordination. For example, self-play or adversarial training of “debater” and “judge” agents (as in Anthropic’s debate research) may scale to arbitrarily complex tasks[8].
- Hypothesis 2 (Engineering): Designing an embodied memory module (a world-interactive database) for LLM agents will enable continual adaptation. Agents could archive experiences (with success scores) and query them dynamically, akin to MemRL’s value-based retrieval[15][15].
Evidence vs speculation vs heuristics:
- Empirical: Survey analyses find that multi-agent LLM systems can solve complex problems (code generation, planning tasks) better than single LLMs[4][15]. Real-world trials (e.g. LLMs driving rudimentary simulations) show promise but frequent failure modes.
- Theoretical: Conceptual frameworks (e.g. [5]) suggest key dimensions (internal planning, tools, environment) needed for agentic intelligence[5]. Some posit that structuring LLMs with roles and checks (inspired by human institutions) can improve safety[6].
- Heuristics: Engineers chain LLM calls (prompt templates like “plan → act → reflect”) and add retrieval buffers for memory. Tool-augmented agents use fixed toolsets (e.g. a calculator, a search engine) chosen by hand.
Divergent views: Major labs differ on the primacy of agency. OpenAI has cautiously experimented with plugin systems, while Anthropic and DeepMind have focused more on safety and world models than on granting chatbots full autonomy. Musk’s xAI boldly bets on agentic AGI (e.g. Grok assistant on X), predicting “superhuman” AI within years[12] – a timeline most researchers deem implausible. Some researchers envision multi-LLM “checks-and-balances” systems for safety, whereas others prioritize monolithic models. Debate continues over where in the stack agency should arise: mainly on top of LLMs with tools, or embedded in new architectures from the ground up.
3. World Models
Leading approaches: Inspired by cognitive science, world models are internal simulators that predict how an environment changes with actions. Latent dynamics networks (e.g. Dreamer) and learned physics engines capture environment state. Systems embed environment pixels into compact representations, then train transition models and reward predictors. Examples include DeepMind’s MuZero and Neural Radiance Fields (NeRFs). Recent research proposes “foundation world models”: large pre-trained networks fine-tuned on specific tasks[2]. Hybrid architectures place a world-model module alongside a language model (LLM supplies broad knowledge, world model supplies grounded prediction)[2][2]. Embodied AI also leans on simulators (e.g. Habitat for navigation), effectively hard-coded world models.
Core bottlenecks: Building rich, high-fidelity world models is extremely hard. The real world is vast and continuous; current simulators cover limited domains. Learning a world model from raw data is sample-inefficient and error-prone—slight model errors compound over long horizons. Integrating symbolic knowledge (language) with numerical simulation remains unsolved. Real-world complexity (fluid dynamics, human behavior) exceeds current model capacity. Training such models at scale is costly.
Why unsolved: World models demand grounded perception, common-sense reasoning, and causal understanding beyond pattern recognition. Surveys note that LLMs “lack an explicit notion of environment state” whereas world models “predict environment dynamics through state transitions”[2]. Bridging this gap requires new learning algorithms or architectures that infer latent physical laws from sensory data. Theoretical understanding of how to merge probabilistic world-modeling with neural language understanding is lacking, and robust metrics are scarce.
Research hypotheses:
- Hypothesis 1 (Theoretical): Emergent world models may exist within LLM hidden states and can be extracted via probing. Training tasks requiring multi-step prediction (e.g. predicting video frames from text prompts) might coax LLMs to internalize world dynamics.
- Hypothesis 2 (Engineering): Integrating a hierarchical modular architecture—a fast differentiable physics engine guided by a neural controller—could improve long-horizon planning.
Evidence vs speculation vs heuristics:
- Empirical: Vanilla LLMs perform poorly on physical prediction tasks, while hybrid models (video prediction, Dreamer) master simple environments.
- Theoretical: Work like EGI argues that “language-based reasoning has limitations in capturing high-dimensional, multi-modal environmental dynamics”[2]. LeCun emphasizes learning world models by watching the world go by, analogizing to a cat learning physics[10].
- Heuristic: Practitioners fine-tune on domain-specific simulators or call physics engines as tools.
Divergent views: Some experts (LeCun, “world-modelers”) insist AGI needs explicit environment simulations[10][2]; others hope scaled LLMs implicitly capture enough of the world to generalize. Multi-LLM collaboration proposals include dedicated “world-modeling” agents[6]. Others pursue “digital twins” (high-fidelity engineering models). Debates also cover embodiment: should AGI be trained via robotics or is abstract data sufficient?
4. Reasoning and Planning
Leading approaches: Improving reasoning has two main tracks. One is chain-of-thought prompting: encouraging LLMs to articulate intermediate steps (“Let’s think step by step”), which boosts multi-step problem-solving[7]. Another is program-aided reasoning: prompting LLMs to generate and execute code (Codex, AlphaCode) or to perform structured search (tree-of-thought prompts). Reinforcement learning agents (MuZero, AlphaStar) remain key for sequential control. Emerging work on meta-reasoning (models that critique/repair their own answers) and neurosymbolic hybrids (combining neural nets with symbolic planners) is gaining traction. Program synthesis—generating small programs to solve puzzles—has advanced, with ARC-AGI top solutions now using deep-learning-guided program synthesis[13].
Core bottlenecks: Complex reasoning requires handling combinatorial search and unforeseen scenarios. LLMs plateau on out-of-distribution tasks (ARC-AGI remains unsolved; GPT-3 scored 0%[13]). Models hallucinate logic, producing plausible but incorrect chains. Planning in large state spaces explodes without heuristics. Bridging probabilistic reasoning (LLMs) and symbolic logic remains unsolved.
Why unsolved: General reasoning likely requires manipulating discrete concepts, which neural nets lack by default. General planning needs causal understanding plus search heuristics. Theory for long-horizon foresight in differentiable systems is thin. Benchmarks like ARC-AGI show scaling alone can’t achieve generalization[13]. Detecting reasoning mistakes without grounding is hard.
Research hypotheses:
- Hypothesis 1 (Engineering): Insert a symbolic planner under LLM supervision—translate tasks into planning language, solve symbolically, translate back.
- Hypothesis 2 (Empirical): Curriculum learning on progressively harder logic puzzles plus self-critique loops may unlock new reasoning skills.
Evidence vs speculation vs heuristics: Chain-of-thought prompting yields large gains but not perfection. ARC-AGI competition results show hybrid (neural + symbolic) systems beating pure neural ones[13]. Some argue (e.g. Chollet) that generalization to novel reasoning defines intelligence[13], implying new architectures are needed. Practitioners chain smaller LLM calls (analysis → solution → verification) and embed symbolic checks.
Divergent views: Some industry labs hoped scaling + RLHF would yield near-human reasoning, while skeptics (Marcus, LeCun) call for extra structure. OpenAI’s “Sparks of AGI” paper cited emergent reasoning[3], but Gary Marcus countered that such claims gloss over gaps[11]. Debates continue over whether gradient-based learning can ever capture “thinking,” or if neurosymbolic/cognitive architectures are required.
5. Memory and Continual Learning
Leading approaches: AGI must learn continually. Memory-augmented networks and retrieval systems are primary tools. Retrieval-augmented generation extends context; episodic memory buffers (MemRL) store experiences with utility weights[15][15]. Fine-tuning on new data or using adapters/LoRA modules is common but expensive. Parameter-efficient incremental learning (prefix tuning) adds knowledge with fewer updates. Benchmarks like MemoryBench simulate user feedback to test retention[14]. RL agents rely on experience replay. Research prototypes separate long-term (semantic) and short-term (working) memory modules, akin to hippocampus and cortex[14].
Core bottlenecks: The stability-plasticity dilemma plagues continual learning: new tasks often catastrophically forget old skills[15]. LLMs with frozen weights cannot easily learn post-deployment. Non-parametric methods like RAG avoid forgetting but don’t adapt reasoning. Metrics for “how much an LLM remembers” are immature.
Why unsolved: Lifelong learning requires preserving a constant model that still absorbs new information. Current ML assumes stationarity or offline retraining. Guaranteeing convergence without interference across infinite tasks remains unsolved. Biological systems use consolidation and replay; neural nets lack analogs. MemoryBench results show SOTA LLM agents are “far from satisfying” at incorporating feedback[14].
Research hypotheses:
- Hypothesis 1 (Empirical): Use a frozen core model plus an external adaptive memory (MemRL) updated via RL, enriching knowledge without changing the core[15][15].
- Hypothesis 2 (Theoretical): Emulate biological consolidation (sleep-like phases, replay) to add knowledge without catastrophic forgetting.
Evidence vs speculation vs heuristics: Memory-augmented agents (MemRL) continuously improve and beat static baselines[15]. Fine-tuned LLMs still struggle to learn from conversations without explicit prompts. Cognitive science suggests dual-memory structures; engineers currently rely on retrieval heuristics or manual expert systems.
Divergent views: Broad agreement on importance, but not on implementation. Some push reinforced memory banks[15]; others refine fine-tuning. OpenAI uses RLHF for adaptation but often retrains models (not true continual learning). Academia revisits cognitive-inspired models, while industry often sidesteps with fresh data ingestion.
6. Alignment and Safety
Leading approaches: Ensuring AGI is safe and aligned has spawned many techniques. RLHF teaches models human preferences. Anthropic’s Constitutional AI (Claude’s constitution) encodes explicit values[17]. Adversarial training (red-teaming) and constrained decoding filter dangerous outputs. Other approaches: mechanistic interpretability, scalable oversight (debate, amplification), and formal verification. Benchmark suites (e.g. Dangerous Questions) probe for hazardous outputs. Regulators (FLI, governments) track lab practices.
Core bottlenecks: Alignment is arguably the hardest problem. Models can behave deceptively or unpredictably, and faithful introspection is lacking—Anthropic finds models hide misleading info even when asked to explain[7]. No foolproof specification of safety exists; RLHF optimizes surface outputs without guaranteeing inner alignment. Chain-of-thought guardrails are incomplete. Distributional shift means aligned models may fail in novel deployments. Economic pressure disincentivizes rigorous safety (FLI reports most labs lack robust safety plans[1]).
Why unsolved: AGI-level risks involve deep theoretical issues (value learning, corrigibility). The error space is vast, making exhaustive testing impossible. Human feedback is expensive and can’t cover open-ended tasks. Social and political incentives often conflict with safety. Technically, no theory guarantees a self-improving system stays aligned.
Research hypotheses:
- Hypothesis 1 (Speculative): Embed a self-supervised consequences model (internal critic) that predicts harm and vetoes unsafe plans.
- Hypothesis 2 (Empirical): Develop rigorous stress-test benchmarks (e.g. simulated social dilemmas) to expose alignment gaps; systematic failures imply the system isn’t AGI-safe.
Evidence vs speculation vs heuristics: Current evidence shows partial success. Anthropic’s experiments find chain-of-thought explanations often unfaithful[7]; even after RLHF, jailbreaks persist. FLI’s AI Safety Index finds no lab with a complete safety plan[1]. Pragmatic heuristics include multi-layer filtering, ethics fine-tuning, and staged deployment. OpenAI’s AGI Charter advocates gradual rollout to learn from errors[18].
Divergent views: OpenAI favors incremental testing and human oversight; Anthropic explores scalable debate/constitution approaches. Musk’s xAI criticizes OpenAI for “abandoning mission”, while OpenAI contests Musk’s timelines. Some (Bostrom, FLI) warn of existential risk, prompting regulation; others prioritize rapid deployment. Disagreements persist on AGI timelines and safety urgency.
7. Research Map and Open Problems
We organize AGI research into the subfields above, with cross-cutting overlaps (world models feed agents, memory assists reasoning, safety constrains all). Key open problems (ranked by impact and tractability):
- Alignment/Safety (Impact: ★★★★★, Tractability: ★★☆☆☆). Essential for AGI’s benefits vs existential risk, but very hard. Current tools have fundamental limits[7].
- Robust Generalization / Out-of-Distribution Reasoning (Impact: ★★★★★, Tractability: ★☆☆☆☆). AGI must handle novel tasks (e.g. ARC-AGI[13]). Deep generalization remains elusive.
- Interpretability and Transparency (Impact: ★★★★☆, Tractability: ★★☆☆☆). Understanding model internals is critical; current methods remain rudimentary.
- Continual Learning and Memory (Impact: ★★★★☆, Tractability: ★★★☆☆). Lifelong adaptation is key; methods like MemRL[15] and MemoryBench[14] are promising but early-stage.
- World Modeling (Impact: ★★★★☆, Tractability: ★★☆☆☆). Grounded models of reality are crucial. Simulators exist, but general models remain out of reach[2].
- Multi-Agent Coordination (Impact: ★★★☆☆, Tractability: ★★★☆☆). AGI may need to orchestrate sub-agents. Research on LLM multi-agent systems[4] is growing, but scalability and evaluation remain open[5].
- Causality and Common Sense (Impact: ★★★☆☆, Tractability: ★★☆☆☆). Understanding cause–effect (beyond correlations) is essential for real intelligence.
8. Benchmarks and Experiments
To drive progress, we need concrete tests. Promising designs and benchmarks include:
- ARC-AGI (Abstraction and Reasoning Corpus): Evaluates generalization to novel tasks designed by humans[13]. GPT-3 scored 0%[13], highlighting gaps. Expand ARC with new domains and measure minimal-example solving.
- Memory/Continual Learning Benchmarks: MemoryBench simulates user feedback over time[14]. Design tasks where agents must recall preferences after sessions.
- Planning and Reasoning Tests: Raven’s Progressive Matrices, arithmetic/programming problems, multi-step quests (BigCodeBench, ALFWorld) stress reasoning. Extend tree-of-thought exams requiring justification, plus adversarial puzzles.
- Embodied Tasks: Virtual environments (robotics simulators, Minecraft, Gym) to test long-horizon planning.
- Safety/Red-Teaming Suites: Automated stress tests probing disallowed behavior (e.g. Anthropic’s hint experiments[7]).
Such experiments should be standardized (open datasets, leaderboards) and diversified across the AGI skill spectrum. Critically, measure both capability and robustness.
9. Indicators and Falsification Criteria for AGI Claims
Given hype cycles (GPT-4 “sparks” and GPT-5 disappointments[3]), we propose criteria for evaluating AGI claims:
- Benchmark Performance: AGI should exceed human baselines across unseen tasks (language, vision, planning, motor control). For example, >95% on ARC-AGI[13] or superhuman general intelligence tests. Failures falsify AGI claims.
- Continual and Autonomous Learning: True AGI learns from zero-shot interactions. Systems needing human fine-tuning for every scenario are narrow. Test whether the system adapts online (MemRL-style experiments[15][15]). Lack of improvement without retraining challenges AGI assertions.
- Alignment and Understanding: AGI claims imply safety by design. If a model hides critical reasoning (Anthropic found unfaithful chains[7]) or fails edge-case value alignment, it isn’t truly beneficial. Report transparency metrics (hallucination rates, chain-of-thought faithfulness).
- Independent Verification: Like science, AGI claims should be reproducible by third parties (e.g. FLI’s call for formal evaluations[1]). Define an “AGI Turing Test”: the system autonomously solves a broad problem specified at runtime, without human decomposition. Failure falsifies AGI.
In practice, labs claiming breakthroughs should publish comprehensive results on agreed benchmarks (ARC, memory tasks, integrated agents). Absence of evidence—or adversarial demo failures—should refute strong AGI claims. Track generalization gaps (novel vs trained domains) and safety lapses (harmful output rate).
10. Conclusion
AGI research today is rich and multidisciplinary, yet unresolved. Empirical evidence from GPT-4, Claude, PaLM and others shows remarkable capability, but each AGI ingredient (grounding, planning, memory, alignment) remains weak. Theoretical insights (cognitive science, complexity) warn that current trends may stall without new ideas. Engineering heuristics (tool use, RLHF, retrieval) drive incremental progress but often patch rather than solve. Disagreements between scalers vs structure-seekers, hype vs skepticism, underscore uncertainty.
Progress over the next 3–10 years requires a pluralistic approach: keep improving foundation models and develop complementary modules (world simulators, memory systems, verification layers). Pursue ambitious experiments (embodied agents, continual learning trials) and stringent benchmarks (ARC-AGI, MemoryBench) to quantify generalization. Measure interpretability and robustness, not just performance. Only by combining evidence-driven engineering with deeper theory (computational neuroscience, symbolic logic, embodied cognition) can we solve the core bottlenecks.
References:
- 2025 AI Safety Index – Future of Life Institute — https://futureoflife.org/ai-safety-index-summer-2025/
- Edge General Intelligence Through World Models and Agentic AI: Fundamentals, Solutions, and Challenges — https://arxiv.org/pdf/2508.09561
- Unlocking the Wisdom of Large Language Models: An Introduction to The Path to Artificial General Intelligence — https://arxiv.org/abs/2409.01007
- LLM Multi-Agent Systems Survey (arXiv:2402.01680) — https://arxiv.org/pdf/2402.01680
- Agents Orchestration for LLM Multi-Agent Systems (arXiv:2402.01680) — https://arxiv.org/pdf/2402.01680
- Planning for AGI and Beyond – OpenAI — https://openai.com/index/planning-for-agi-and-beyond/
- Reasoning models don't always say what they think – Anthropic — https://www.anthropic.com/research/reasoning-models-dont-say-think
- Anthropic Fall 2023 Debate Progress Update — https://www.alignmentforum.org/posts/QtqysYdJRenWFeWc4/anthropic-fall-2023-debate-progress-update
- What if A.I. Doesn’t Get Much Better Than This? – The New Yorker — https://www.newyorker.com/culture/open-questions/what-if-ai-doesnt-get-much-better-than-this
- Meta’s AI Chief Yann LeCun on AGI, Open-Source, and AI Risk – TIME — https://time.com/6694432/yann-lecun-meta-ai-interview/
- “Scale Is All You Need” is dead – Gary Marcus — https://garymarcus.substack.com/p/breaking-news-scale-is-all-you-need
- Tesla's Musk predicts AI will be smarter than the smartest human next year – Reuters — https://www.reuters.com/technology/teslas-musk-predicts-ai-will-be-smarter-than-smartest-human-next-year-2024-04-08/
- ARC Prize 2024: Technical Report — https://arxiv.org/html/2412.04604v1
- MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems — https://arxiv.org/html/2510.17281v1
- MemRL outperforms RAG on complex agent benchmarks without fine-tuning – VentureBeat — https://venturebeat.com/technology/memrl-outperforms-rag-on-complex-agent-benchmarks-without-fine-tuning
- A Survey on Agentic Multimodal Large Language Models — https://arxiv.org/abs/2510.10991
- Claude's Constitution – Anthropic — https://www.anthropic.com/constitution
- OpenAI Charter — https://openai.com/charter