Latest 15 Papers - June 28, 2025

Jun 28, 2025 by ADMIN 33 views

Latest 15 Papers - June 28, 2025: Embodied AI, Reinforcement Learning, and Robotics

Please check the Github page for a better reading experience and more papers.

Embodied AI: Advancements and Research Papers

In the dynamic field of Embodied AI, the latest research papers reveal significant progress in spatial-temporal world understanding, continual learning, and vision-language navigation. This section delves into these advancements, providing a comprehensive overview of the cutting-edge research shaping the future of AI agents that can interact with and learn from their environments. Understanding these developments is crucial for researchers and practitioners alike, as Embodied AI is poised to revolutionize robotics, autonomous systems, and human-computer interaction.

Spatial-Temporal World Understanding and MLLMs

One notable paper, STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?, explores the capabilities of Multimodal Large Language Models (MLLMs) in understanding the spatial and temporal dynamics of the world. The ability to precisely interpret and reason about spatial-temporal information is fundamental for Embodied AI agents operating in complex environments. This research likely introduces a benchmark, STI-Bench, designed to evaluate MLLMs on their spatial-temporal reasoning abilities, highlighting both the current strengths and limitations of these models. Such evaluations are crucial for guiding future research directions and developing more robust and reliable AI systems. Understanding the intricacies of how MLLMs perceive and process spatial-temporal data can lead to significant improvements in applications such as autonomous navigation, robotic manipulation, and interactive environments.

Continual Learning with Gaussian Splatting

The paper CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization, accepted at ICCV 2025, presents an innovative approach to continual learning using Gaussian Splatting. Continual learning, or lifelong learning, is a critical capability for Embodied AI agents that need to adapt to changing environments and tasks over time. The technique of Gaussian Splatting, combined with local optimization, allows the system to efficiently update its representation of the environment without catastrophically forgetting previously learned information. This research is particularly relevant for applications where AI agents must operate in dynamic settings, such as in robotics and augmented reality. The project page, https://cl-splats.github.io, likely provides additional details and resources, offering valuable insights into the practical implementation and performance of this method.

Vision-Language Navigation and Reinforcement Fine-Tuning

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning introduces a method for Vision-Language Navigation (VLN) that utilizes reinforcement fine-tuning. VLN is a challenging task in Embodied AI where an agent must navigate through an environment based on natural language instructions and visual input. Reinforcement learning offers a powerful framework for training agents to make optimal decisions in complex environments, and fine-tuning pre-trained models can further enhance performance. The project page, vlnr1.github.io, likely showcases the results and methodology of this research, providing a valuable resource for those interested in the intersection of vision, language, and navigation in AI systems. This research has implications for applications such as robotic assistants, autonomous vehicles, and virtual reality environments.

General World Models: A Brief Survey

From 2D to 3D Cognition: A Brief Survey of General World Models offers a survey of general world models, focusing on the transition from 2D to 3D cognitive understanding. World models are crucial for Embodied AI as they allow agents to predict and reason about their environment. This survey likely covers various approaches to building and utilizing world models, emphasizing the importance of 3D perception for more robust and realistic simulations. The transition from 2D to 3D cognition is a significant step towards creating AI systems that can truly understand and interact with the physical world. This survey serves as a valuable resource for researchers looking to gain a comprehensive understanding of the current state-of-the-art in world modeling.

Multi-sensor Fusion Perception for Embodied AI

A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects provides an overview of multi-sensor fusion perception, a critical aspect of Embodied AI. AI agents often rely on multiple sensors (e.g., cameras, LiDAR, IMUs) to perceive their environment. Multi-sensor fusion techniques combine data from these sensors to create a more complete and accurate representation of the world. This survey likely covers the background, methods, challenges, and future prospects of multi-sensor fusion in Embodied AI, offering a valuable resource for researchers and practitioners working on building robust and reliable AI systems.

Intelligent Science Laboratory and the Integration of AI

Position: Intelligent Science Laboratory Requires the Integration of Cognitive and Embodied AI presents a position paper advocating for the integration of cognitive and Embodied AI in Intelligent Science Laboratories. This paper likely argues that combining cognitive AI, which focuses on high-level reasoning and decision-making, with Embodied AI, which emphasizes physical interaction with the environment, is essential for advancing scientific discovery. Such integration could lead to the development of AI systems that can autonomously design and conduct experiments, analyze data, and generate new hypotheses, thereby accelerating the pace of scientific research.

Temporally Consistent Relighting for Dynamic Long Videos

TC-Light: Temporally Consistent Relighting for Dynamic Long Videos introduces a method for temporally consistent relighting of dynamic long videos. Relighting, the process of changing the illumination in a scene, is a crucial technique for creating realistic and visually appealing content. Ensuring temporal consistency is particularly important in videos to avoid flickering and other artifacts. This research, with its project page at https://dekuliutesla.github.io/tclight/ and code at https://github.com/Linketic/TC-Light, likely presents a novel approach to relighting that maintains visual coherence over time, enhancing the realism of dynamic video content. This has implications for virtual production, augmented reality, and video editing applications.

Spatial Reasoning in Vision-Language Models: A Comprehensive Dataset

InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models introduces InternSpatial, a comprehensive dataset designed for spatial reasoning in Vision-Language Models. Datasets play a crucial role in training and evaluating AI models, and a high-quality dataset specifically focused on spatial reasoning can significantly advance research in this area. This dataset likely contains a diverse set of scenarios and tasks that require spatial understanding, such as object localization, spatial relationships, and navigation. The availability of such a dataset will facilitate the development of more sophisticated AI systems that can effectively reason about spatial information.

Telerobotics with Cloud-Fog Automation: A Practical Architecture

CFTel: A Practical Architecture for Robust and Scalable Telerobotics with Cloud-Fog Automation presents a practical architecture for telerobotics that leverages cloud-fog automation. Telerobotics, the remote operation of robots, has numerous applications in industries such as healthcare, manufacturing, and exploration. This paper, accepted for presentation at the 23rd IEEE International Conference on Industrial Informatics (INDIN) in July 2025, likely details a system that combines the benefits of cloud computing (e.g., scalability, centralized processing) with fog computing (e.g., low latency, edge processing) to create a robust and scalable telerobotics platform. This architecture can enable more reliable and efficient remote robot operation in various applications.

Evaluating World Models with Vision-Language Models

Adapting Vision-Language Models for Evaluating World Models explores the adaptation of Vision-Language Models for evaluating world models. World models, as previously discussed, are crucial for Embodied AI agents. Evaluating the quality and accuracy of these models is essential for ensuring the reliability of AI systems. This research likely presents a methodology for using Vision-Language Models to assess how well a world model captures the dynamics and semantics of an environment. This approach can provide valuable insights into the strengths and weaknesses of different world modeling techniques.

Multi-agent Embodied AI: Advances and Future Directions

Multi-agent Embodied AI: Advances and Future Directions discusses the advances and future directions in multi-agent Embodied AI. Multi-agent systems, where multiple AI agents interact and collaborate, are increasingly important for solving complex problems. This paper likely provides an overview of the current state-of-the-art in multi-agent Embodied AI, highlighting key challenges and opportunities for future research. This includes topics such as coordination, communication, and cooperation among agents in dynamic environments.

Interactive Safety of VLM-Driven Embodied Agents

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks introduces IS-Bench, a benchmark for evaluating the interactive safety of Vision-Language Model (VLM)-driven Embodied AI agents in daily household tasks. Safety is a paramount concern in Embodied AI, especially when agents operate in human environments. This benchmark likely provides a standardized set of tasks and metrics for assessing the safety performance of AI agents, ensuring that they can interact safely with humans and their surroundings. This is crucial for the widespread adoption of Embodied AI in real-world applications.

Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning

DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning presents DualTHOR, a dual-arm humanoid simulation platform designed for contingency-aware planning. Simulation environments are invaluable for training and testing AI agents before deployment in the real world. This platform likely provides a realistic and flexible environment for developing AI systems that can handle unexpected events and contingencies. Dual-arm manipulation is a challenging task in robotics, and this platform facilitates research in this area.

Reconstruction-Free Open-Vocabulary 3D Object Detection

BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion introduces BoxFusion, a method for reconstruction-free open-vocabulary 3D object detection using real-time multi-view box fusion. Object detection, the ability to identify and localize objects in a scene, is a fundamental capability for Embodied AI agents. Open-vocabulary object detection allows agents to recognize objects they have not been explicitly trained on. This research likely presents a novel approach to 3D object detection that is both efficient and versatile, enhancing the perception capabilities of AI agents.

Environmental Understanding for Visual Navigation

Efficient and Generalizable Environmental Understanding for Visual Navigation focuses on efficient and generalizable environmental understanding for visual navigation. Visual navigation, the ability to navigate through an environment using visual input, is a core competency for Embodied AI agents. This research likely presents techniques for building AI systems that can efficiently and robustly understand their environment, enabling them to navigate effectively in a variety of settings. This is essential for applications such as autonomous robots and self-driving vehicles.

Reinforcement Learning: New Research and Developments

The field of Reinforcement Learning (RL) continues to evolve, with new research papers exploring diverse topics such as demand charge scheduling, preference alignment, and bridging offline and online learning. This section provides an in-depth look at these recent advancements, highlighting the key findings and their potential impact on the future of RL applications. Understanding these developments is crucial for researchers and practitioners seeking to leverage Reinforcement Learning for solving complex decision-making problems.

Joint Scheduling of DER Under Demand Charges

The paper Joint Scheduling of DER under Demand Charges: Structure and Approximation delves into the complex problem of jointly scheduling Distributed Energy Resources (DER) under demand charges. DERs, such as solar panels and battery storage systems, are becoming increasingly prevalent in modern energy grids. Efficiently scheduling these resources under demand charge constraints is a challenging optimization problem. This research, spanning 15 pages with 4 tables and 4 figures, likely presents a structural analysis of the problem and proposes approximation algorithms for finding near-optimal solutions. This work is crucial for improving the efficiency and reliability of energy grids, especially as DER adoption continues to grow. Understanding the intricacies of DER scheduling can lead to significant cost savings and environmental benefits.

Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment

Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment introduces a novel approach to dynamic preference alignment using Multi-Preference Lambda-weighted Listwise Direct Preference Optimization (DPO). Preference alignment, the process of aligning an AI agent's behavior with human preferences, is a critical challenge in Reinforcement Learning. This paper, slated to appear in the Proceedings of AAAI 2026, presents a method for dynamically adjusting the agent's behavior based on multiple preferences. The code for this research is available at https://github.com/yuhui15/Multi-Preference-Lambda-weighted-DPO, providing a valuable resource for those interested in implementing this technique. This approach has significant implications for applications where AI systems need to adapt to varying user needs and preferences.

Bridging Offline and Online Reinforcement Learning for LLMs

Bridging Offline and Online Reinforcement Learning for LLMs explores the intersection of offline and online Reinforcement Learning for Large Language Models (LLMs). LLMs have shown remarkable capabilities in natural language processing, but their application in RL is still an active area of research. Bridging the gap between offline RL, where agents learn from pre-collected data, and online RL, where agents interact with the environment, is crucial for developing robust and adaptable AI systems. This research likely investigates methods for leveraging offline data to pre-train LLMs, which can then be fine-tuned online through interaction with the environment. This approach can significantly improve the efficiency and effectiveness of training LLMs for RL tasks.

In-Context Reinforcement Learning in Transformers: From Memories to Maps

From Memories to Maps: Mechanisms of In-Context Reinforcement Learning in Transformers delves into the mechanisms of in-context Reinforcement Learning in Transformer models. Transformer models, known for their ability to process sequential data, have shown promise in RL tasks. In-context learning, where models learn from the context provided in the input, is a powerful capability that can enable rapid adaptation to new situations. This research likely investigates how Transformer models utilize memories and construct internal maps of the environment to facilitate in-context RL. Understanding these mechanisms can lead to the development of more efficient and flexible RL agents. The updated version of the paper includes additional funding sources and formatting corrections.

Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities

Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities provides a comprehensive overview of the intersection between graph-based methods and AI agents. Graphs are a powerful tool for representing relationships and structures, making them well-suited for RL tasks that involve complex environments. This paper, spanning 20 pages with 7 figures, likely presents a taxonomy of graph-based RL methods, reviews the progress in this area, and identifies future research opportunities. This survey is a valuable resource for researchers looking to leverage graph structures for enhancing the capabilities of RL agents.

Optimising 4th-Order Runge-Kutta Methods

Optimising 4th-Order Runge-Kutta Methods: A Dynamic Heuristic Approach for Efficiency and Low Storage explores the optimization of 4th-order Runge-Kutta methods, a class of numerical methods commonly used in Reinforcement Learning for solving differential equations. Efficiently solving differential equations is crucial for simulating dynamic systems, which are often encountered in RL environments. This research likely presents a dynamic heuristic approach for optimizing Runge-Kutta methods, aiming to improve computational efficiency and reduce storage requirements. This can lead to faster and more scalable RL algorithms.

Reward Modeling as Discriminative Prediction

Fake it till You Make it: Reward Modeling as Discriminative Prediction investigates reward modeling as a discriminative prediction task. Reward modeling, the process of learning a reward function from data, is a key component of many RL algorithms. This research likely presents a novel perspective on reward modeling, framing it as a discriminative prediction problem. This approach can potentially simplify the reward modeling process and improve the performance of RL agents. This has implications for applications where reward functions are difficult to define or obtain directly.

Spatial Mental Modeling from Limited Views

Spatial Mental Modeling from Limited Views focuses on spatial mental modeling from limited viewpoints, a critical capability for agents operating in partially observable environments. RL agents often need to build internal representations of their environment based on limited sensory input. This research likely presents a method for constructing spatial mental models from partial observations, enabling agents to make informed decisions even when their view of the world is incomplete. This is particularly relevant for robotics and autonomous navigation tasks.

Flow-Based Single-Step Completion for Policy Learning

Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning introduces a flow-based approach to single-step completion for efficient and expressive policy learning. Policy learning, the process of learning an optimal policy for decision-making, is a core aspect of Reinforcement Learning. This research likely presents a novel method for policy learning that leverages flow-based models to efficiently complete partial trajectories. This approach can potentially improve the sample efficiency and expressiveness of RL algorithms.

Continual Learning as Computationally Constrained Reinforcement Learning

Continual Learning as Computationally Constrained Reinforcement Learning explores continual learning within the framework of computationally constrained Reinforcement Learning. Continual learning, as discussed earlier, is a crucial capability for AI agents that need to adapt to changing environments. This research likely formulates continual learning as an RL problem with computational constraints, addressing the challenges of efficiently learning new tasks without forgetting previously learned ones. This is particularly relevant for resource-constrained devices and applications.

Masked Diffusion Models for Code Generation

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation investigates the use of masked diffusion models for code generation. Code generation, the task of automatically generating computer code, is an emerging application of AI. This research likely presents an analysis of masked diffusion models in the context of code generation, identifying areas for improvement and proposing techniques to enhance their performance. This has implications for software development and automated programming.

Regularizing Q-Value Distributions with Image Augmentation

rQdia: Regularizing Q-Value Distributions With Image Augmentation introduces rQdia, a method for regularizing Q-value distributions using image augmentation. Q-learning, a popular RL algorithm, relies on estimating Q-values, which represent the expected reward for taking a particular action in a given state. This research likely presents a technique for regularizing the distribution of Q-values using image augmentation, which can improve the stability and performance of Q-learning algorithms.

Value of Information for Communication and Control in 6G V2X

Learning Value of Information towards Joint Communication and Control in 6G V2X explores the value of information for joint communication and control in 6G Vehicle-to-Everything (V2X) systems. V2X communication, which enables vehicles to communicate with each other and with infrastructure, is crucial for autonomous driving and intelligent transportation systems. This research likely investigates how RL can be used to learn the value of information in V2X systems, optimizing communication and control strategies. This has significant implications for the development of safer and more efficient transportation systems.

Regret Bounds for Robust Online Decision Making

Regret Bounds for Robust Online Decision Making focuses on regret bounds for robust online decision-making. Online decision-making, where decisions are made sequentially over time, is a central aspect of Reinforcement Learning. This research likely presents theoretical results on regret bounds, which provide guarantees on the performance of online decision-making algorithms in the face of uncertainty and adversarial environments. This contributes to the theoretical foundations of RL and robust decision-making.

LLM Learns When to Think

Thinkless: LLM Learns When to Think introduces a novel concept where Large Language Models (LLMs) learn when to engage in more computationally intensive reasoning processes. LLMs, while powerful, can be computationally expensive to run. This research likely presents a method for enabling LLMs to selectively engage in deeper reasoning only when necessary, thereby improving efficiency. This approach has significant implications for deploying LLMs in resource-constrained environments and for tasks that require real-time responsiveness.

Robotics: Advancements in Grasping, Navigation, and Human-Robot Interaction

The field of Robotics continues to advance rapidly, with new research papers focusing on areas such as robotic grasping, navigation in crowded environments, and human-robot interaction. This section delves into these latest developments, offering a comprehensive overview of the cutting-edge research shaping the future of Robotics. Understanding these advancements is crucial for researchers and practitioners alike, as Robotics is poised to transform industries and improve our daily lives.

Robotic Grasping with Consensus-Driven Uncertainty

One significant paper, Consensus-Driven Uncertainty for Robotic Grasping based on RGB Perception, accepted to IROS 2025, presents a novel approach to robotic grasping using consensus-driven uncertainty based on RGB perception. Robotic grasping, the ability of robots to grasp and manipulate objects, is a fundamental capability in many applications. Uncertainty estimation is crucial for robust grasping, as it allows robots to adapt to variations in object shape, size, and pose. This research likely introduces a method for quantifying uncertainty in grasping using RGB images, leveraging consensus among multiple grasping hypotheses. This approach can improve the reliability and success rate of robotic grasping systems.

User Interpretation and Decision-Making in Healthcare Robotics

The paper **[