HUMANOID
Introducing KinetIQ — Our Proprietary AI Framework for Humanoid Robot Fleet Orchestration
Introduction
At Humanoid, we are building a range of humanoid robotic products aimed at addressing physical labour demand across multiple sectors, from industrial environments to service and home applications. We are building our AI stack, KinetIQ, in a way that unifies different products and leverages shared capabilities between them as much as possible. In a recently released video, we demonstrated how KinetIQ can orchestrate execution of multiple interrelated tasks in industrial and service settings, operating across two different robot embodiments and several timescales: from high-frequency whole-body control to multi-robot fleet orchestration. In this blog post, we will share some technical details on different components of KinetIQ, the design decisions behind them, and some of the present and future challenges involved in building such systems.
KinetIQ architecture overview
KinetIQ system architecture consists of 4 distinct layers covering different timescales and orchestration scopes. Following and extending terminology introduced by Stanovich, West, and later popularized by Kahneman, we assign a number to each system indicating its level in the overall cognitive hierarchy:
HUMANOID | KINETIQ
System 3
SEC — MIN
— coordinates multiple robots to achieve externally defined fleet-level goals, treating robots as tools within an agentic framework. Operates on timescales of seconds to minutes and beyond.
System 2
SEC — MIN
— coordinates multiple robots to achieve externally defined fleet-level goals, treating robots as tools within an agentic framework. Operates on timescales of seconds to minutes and beyond.
System 1
10 HZ
— coordinates multiple robots to achieve externally defined fleet-level goals, treating robots as tools within an agentic framework. Operates on timescales of seconds to minutes and beyond.
System 0
50 HZ
— coordinates multiple robots to achieve externally defined fleet-level goals, treating robots as tools within an agentic framework. Operates on timescales of seconds to minutes and beyond.
There is a clear narrative across these layers: each system receives goals from its supervising system and aims to achieve them while treating the layers beneath it as a set of tools within an agentic framework. This decomposition allows us to cleanly reason about individual systems, tweak and test their behavior while abstracting away layers below and above.
We fully expect that, over time, the boundaries between different layers will become progressively fuzzier. However, given the current state of AI technology, we believe this decomposition strikes the right balance between achievable capability and system transparency. This, in turn, facilitates development and reduces time‑to‑market, an important consideration for a commercially oriented company like ours.
In the subsequent sections of this blogpost, we will explore the cognitive hierarchy layers of KinetIQ in more detail, discussing some technical aspects of their current implementations as well as some challenges practitioners may encounter when working on similar systems.
System 0
System 0 is the whole-body control layer of KinetIQ. Its role is to execute target poses produced by System 1 while ensuring that motion remains dynamically stable, physically feasible, and robust to external disturbances. It operates at a 50 Hz control rate and forms the lowest layer in the cognitive hierarchy, interfacing directly with robot hardware.
Unlike the layers above it, System 0 does not reason about tasks, intent, or symbolic state. Its responsibility is purely physical: given a stream of pose-level objectives, it continuously produces joint-level commands that respect robot dynamics, kinematic constraints, contact conditions, and balance requirements.
System 0 operates on top of a compliant low-level control layer running at 1 kHz, which absorbs impacts and handles high-frequency disturbances during physical interaction.
Proprioceptive feedback from actuators and onboard sensors closes the control loop, allowing joint commands to be continuously updated in real time while faithfully tracking the motion intent produced by System 1.
Bipedal locomotion policies are trained with this control layer in the loop, allowing them to learn the true closed-loop dynamics of the system.
SYSTEM 0
Hybrid whole-body control
System 0 is implemented as a hybrid control system composed of two tightly coupled components: a shared upper-body inverse kinematics (IK) controller and a lower-body reinforcement-learning (RL) locomotion policy.
The upper-body controller is identical across platforms. Both wheeled and bipedal robots use the same IK-based formulation for upper-body frames such as arms, torso, and head. By keeping upper-body control invariant to embodiment, we ensure that VLA-based manipulation policies transfer cleanly across robots, independent of how locomotion is implemented underneath.
An additional advantage of using a classical IK-based whole-body controller for the upper body is configurability at execution time. The IK formulation provides explicit control over redundancy resolution, allowing us to decide how different degrees of freedom—such as arms and torso—are used for a given task. This configurability is exposed through well-defined controller variants and parameters, enabling the execution strategy to be adjusted.
This makes it possible to bias execution toward different strategies and task-appropriate postures, for example prioritizing arm-only motion for precision tasks or leveraging the torso when reach or workspace coverage is required.
Beyond redundancy management, the IK layer allows us to enforce hard constraints that are difficult to guarantee reliably at the policy level, and to configure them dynamically based on task requirements.
This is particularly important for tasks that require strict geometric guarantees for success. For example, when handling a tote, it can be useful to couple both arms, enforcing hard relative-pose constraints between end effectors so that the object pose is maintained throughout execution. System 0 exposes the ability for System 1 to enable and disable such regimes as requested.
By handling these constraints at the control layer, System 0 ensures safe and consistent execution of lower-level behaviors, while allowing the layers above to remain focused on high-level intent rather than strict feasibility enforcement. Lower-body control is handled by an RL-trained policy responsible for balance, gait generation, and dynamic recovery. This policy is specific to bipedal locomotion but composes cleanly with the shared upper-body controller by treating upper-body motion as an external disturbance and using the pelvis as its primary motion reference. Pelvis pose targets are exposed as the interface between System 1 and the locomotion policy: by specifying desired pelvis position and orientation, the VLA can influence bipedal behaviors such as walking, turning, or squatting, while the lower-body controller autonomously resolves balance, and recovery.
This separation allows whole-body behaviors to emerge without entangling manipulation logic with embodiment-specific locomotion control.
This hybrid structure enables repeatable and reusable whole-body control across platforms, minimizing embodiment-specific tuning while pairing robust learned locomotion with precise, transferable IK-based manipulation.
For bipedal robots, this separation introduces an intentional limitation on the degree of whole-body coordination the system can express as also concluded in [1]. In highly dynamic scenarios, humans naturally recruit their arms, torso, and lower body jointly to maintain balance and recover from disturbances. In our current setup, upper-body manipulation is not directly used as a balance mechanism, and arm motion remains decoupled from the locomotion policy.
We are fully aware of recent approaches that pursue tightly coupled whole-body locomanipulation through unified policies [2]. However, for the class of applications we are targeting today – where predictability, robustness, and cross-embodiment reuse are primary requirements – we deliberately favor this structured decomposition for bipedal control.
This choice allows us to scale capability development and deploy reliable behavior in real-world settings, while keeping the door open to tighter whole-body integration as learning-based methods and product requirements evolve.
SYSTEM 0
Training and robustness
The bipedal locomotion policy is trained purely in simulation using online reinforcement learning. Training is performed across thousands of parallel scenarios with deliberately limited domain randomization, relying instead on accurate system identification [3] to align simulation with real-world dynamics.
This allows us to learn robust walking and reliable recovery behaviors under well-defined physical conditions, resulting in policies that are less conservative while remaining robust, stable and easy to compose with higher-level planning and manipulation policies.
SYSTEM 0
Abstraction boundary
System 0 deliberately hides the complexity of whole-body dynamics from the layers above it. From the perspective of System 1 and System 2, it behaves as a reliable actuator: given target poses, it produces stable motion.
This abstraction boundary allows higher layers to reason in task-centric and symbolic spaces without being exposed to the full complexity of humanoid dynamics, while still enabling physically capable behavior in the real world
One additional reason to use target poses as the input interface to the System 0 is that this abstraction aligns naturally with scalable data sources expressed in end-effector space, such as UMI-like devices [4] or egocentric video: if System 1 outputs describe end-effector motion, the use of these modalities in System 1 training is relatively straightforward.
At Humanoid, we see UMI-style approaches as a promising way to scale data collection while keeping the embodiment gap low.
Our hardware design efforts explicitly account for the requirement of having UMI devices that reflect the kinematics and constraints of the current robot hardware as closely as possible.
System 1
System 1 is a low-level goal handling layer of KinetIQ. It receives natural language goal descriptions from System 2 (e.g. “lift the tote”, “pick the oil bottle from the shelf”), plans the motions needed to progress towards completing the goal given the observed state, expresses those motions as target poses, and instructs System 0 to achieve these poses. KinetIQ System 1 is based on Vision-Language-Action (VLA) model technology.
VLA models have been a breakthrough in applying deep learning to manipulation and contextual navigation because they managed to unify what used to be separate, brittle components (task-level perception, decision-making and motion planning) into a learned policy grounded in vision and language, while still benefiting from classical state estimation and low-/mid-level control.
At Humanoid, we had the luxury of starting without carrying over a legacy locomanipulation stack, so we built around deep-learning-first methods from day one and immediately started using VLAs as our primary tool for implementing lower-level capabilities.
Our current VLA models are built for real-world deployment on edge hardware. Today, we use a Vision-Language Model (VLM) as VLA backbone, processing multi-camera inputs alongside language goals and robot state, and then routing this information through specialized action decoders to produce actions.
Pre-training VLMs on internet-scale visual-language data gives them a strong prior for recognizing objects, understanding scenes, and following natural-language intent, capabilities that carry over to locomanipulation. We predict actions via an action head attached to the VLM backbone. We normally predict actions in chunks, corresponding to a 1-2 second horizon, and execute each action in the chunk at a high rate (30-50Hz). Chunking allows us to disentangle policy inference performance and control signal frequency.
Our current VLA stack deliberately supports multiple action head architectures on top of the same VLM backbone. Conditional flow matching (CFM) heads work well for robust fine-tuning to unseen tasks, while delivering strong inference performance.
As an example, inference in our 3B CFM VLA takes about 34 ms on an H100 GPU.
However, CFM-based models come with certain limitations. For example, reinforcement learning (RL) for CFMs is still emerging, although we see early signs of life with approaches such as [7]. Autoregressive heads, by contrast, plug directly into the RL toolchain that LLM research has spent years refining. This is why a lot of our current work on RL for locomanipulation focuses on autoregressive models, creating the demand for a VLA toolset that is agnostic to the exact form of the machinery that emits actions.
All that said, VLM backbones are not the endgame for embodied intelligence. Despite their strengths, they still perceive dynamics and physics poorly because their training signal is largely static, observational and mostly high-level. We expect video generation backbones to close this gap, and better capture physical world dynamics, as evidenced by some early works such as [5] or [6]. KinetIQ architecture is designed to seamlessly integrate these new types of models as they advance without re-architecturing upstream and downstream layers.
SYSTEM 1
Asynchronous inference
We run VLA inference both on-board and in the cloud. The ability to run inference on the cloud is important because it lets us iterate quickly on new models without immediately optimizing for target edge-hardware, and gives us the ability to experiment with larger models.
The tradeoff is network latency and jitter: network delays can make synchronous cloud inference produce stepwise motion with visible pauses between action chunks. This is undesirable from the aesthetics standpoint. More importantly, delays also affect policy execution dynamics, pushing the policy out of distribution, especially if it perceives its own dynamics through some form of memory.
To address this, we use asynchronous inference: the model predicts the next action chunk while the previous chunk is still executing. The core challenge is that the prediction is made before the ongoing chunk has finished, so the model can condition on a slightly “stale” view of the world and produce actions that are inconsistent with what will actually happen during inference, leading to jerky motions when switching between chunks.
In other words, we need a way to tell the model what will happen by the time inference finishes. One well-known solution is real-time action chunking via inpainting, borrowed from diffusion literature [8]. We avoid it because it adds inference-time overhead and is only applicable to CFM models, while, as mentioned above, we want flexible support for multiple action heads.
Instead, we use a model-agnostic method we call prefix conditioning. We pass the part of the action chunk that will execute during inference as an additional input to the model. During training, when this input is present, the model learns to copy it into the output (since it matches the ground truth) and to produce a smooth continuation for the remainder of the chunk.
By sampling different prefix lengths during training, we can tolerate a range of inference latencies and deployment conditions. Because this mechanism is purely based on conditioning, it works with any action head architecture. The closest published approach we are aware of is [9].
SYSTEM 1
Value prediction
System 1 should be able to understand its progress, not just produce actions. To this end, we train an additional value head that predicts time-to-completion — the number of remaining steps until the task is successfully finished given the current state.
As this estimate drops toward zero, execution is on track; if it plateaus, something’s wrong. This signal flows back to System 2, enabling it to retry, adjust goals, or escalate when needed.
Time-to-completion can also be predicted at the level of individual actions, providing an estimate of how much a certain action advances execution towards the goal. This opens up another interesting opportunity: one can sample multiple action chunks during inference and then commit to the one that will lead to success faster. This simple technique allows to trade off inference compute for improved VLA policy performance.
SYSTEM 1
Capability factory
Zero-shot generalization to arbitrary novel behaviors remains out of reach for VLA technology for now. To ship a robust product, especially for the industrial setting, we instead focus on building a capability factory: a toolset that lets us rapidly bootstrap the behaviors of interest and then polish them until they meet deployment robustness and KPI requirements.
Our goal is to bring this tooling to a level where new robust behaviors can be curated in hours from a small number of demonstrations by robot trainers who don’t have an engineering degree, thus enabling rapid scaling of available low-level capabilities. Our current toolset includes convenient tools for data collection, dataset curation, labeling, training and evaluation. We record both teleoperation and autonomous policy execution through a unified stack, which makes workflows like DAgger and training on successful policy rollouts very natural. Our tooling also provides a number of readily available recipes to make capability post-training much easier.
One feature of our capability factory that we find quite useful is behavior gluing: the ability to compose a single low-level behavior (i.e. the behavior associated with a specific prompt) from multiple sub-behaviors. This is useful when a behavior has both hard and easy parts. For example, in machine feeding scenarios, picking can be difficult and data-hungry, while moving a grasped item is much easier and can be learned from very little data.
Behavior gluing lets us concentrate data collection on the hard part (picking) instead of collecting many end-to-end demonstrations, which in such cases can be wasteful, or delegating orchestration of sub-behaviors to System 2, which can unnecessarily increase complexity.
One problem the capability factory needs to address is how to curate behaviors that execute faster than the speed humans are able to maintain during teleoperation. Modern-day practitioners often solve this problem with a technique called sport mode. The idea of sport mode is to execute a policy at a faster rate than the rate used to collect the imitation data, providing a straightforward way to improve policy throughput. However it only works reliably under very restrictive assumptions such as quasi-static scenarios, which cannot be assumed in all cases.
One typical failure of the sport mode on our current wheeled system is that it cannot close grippers at arbitrary speeds, and we normally are operating at maximum gripper speed even when executing policies at 1x. Therefore, if we speed up execution further, end effectors start moving faster, but the gripper doesn’t close any faster, leading to situations where the robot arm starts moving away from objects before completing the grasp. Similar issues can potentially happen with any actuator.
Our capability factory currently addresses this issue via a technique we call sport-mode data preprocessing: applying non-linear execution speedup to the collected data during the preprocessing stage, keeping the speed of potentially fragile parts of behaviors (e.g. grasping) at 1x.
It is available in the form of a recipe that a behavior curator can apply to demonstrations prior to training.
We demonstrate this technique below by evaluating 3 policies trained on the same data on a machine feeding task. The first policy is running at 1x rate, the second one has naive 2.6x execution-side speedup applied. The third policy shows the effect of the same speedup applied on the data side with sport-mode data preprocessing.
Naive speedup increases motion speed but harms success rate, resulting in a suboptimal throughput. Data-side preprocessing speedup preserves motion speed and has much lower success rate impact, resulting in better throughput overall.
We evaluate our machine feeding models along three dimensions: throughput, measured by units per hour (UPH); efficient use of each arm, measured as how many transfers (one or more items moved in a single arm motion) are done per hour; Pick Success Rate (SR), measured as the rate at which a pick attempt succeeds.
One can come up with all sorts of creative recipes to make imitation-based policies better, but long term we bet on reinforcement learning as the most direct way to overcome core imitation learning limitations and reach robustness and throughput compatible with production deployment.
It is also conceptually a very simple way of polishing behaviors: applying RL does not require explicit reasoning about distribution coverage or specific failure modes. If the starting policy covers the desired behaviors well enough, only consistent reward assignment matters. Finally, reinforcement learning is a natural way to greatly benefit from monitored deployments: one can continuously improve policies as the deployment fleet operates, treating fleet operator interventions as negative rewards.
We will share more details on our experience with applying reinforcement learning to VLA training in a separate blogpost.
System 2
System 2 is the robot-level executive layer of KinetIQ. Its role is to coordinate the actions of a single robot in pursuit of tasks assigned by System 3, operating over discrete symbolic state rather than continuous control.
While System 1 produces low-level, reactive behavior grounded in perception, System 2 is responsible for intent, sequencing, and adaptation.
System 2 treats the robot’s capabilities as a set of tools: navigation, manipulation policies, perception queries, recovery behaviors, and internal services.
Each tool exposes preconditions and outcomes, and System 2 reasons over these to decide what the robot should attempt next, in what order, and under which conditions.
SYSTEM 2
Tool-based agentic control
The core design principle of System 2 is that robot capabilities are treated as callable programs. Even internal operations such as re-localization, resetting hardware, or switching control modes are exposed as tools.
System 2 composes these tools into short-horizon plans that are executed online and revised as execution unfolds.
For example, a task such as “pick an object from the shelf and place it on table” may be decomposed into navigation, perception, manipulation, and verification steps, with conditional branching based on observed outcomes.
This tool-based abstraction allows System 2 to remain agnostic to how individual capabilities are implemented. Classical planners, learned policies, and hand-engineered recovery behaviors can all coexist behind the same interface.
From the perspective of System 2, they are simply operators with known semantics and failure modes.
Our layered approach allows us to leverage state-of-the-art frontier models for System 2 which excel at tool-use.
SYSTEM 2
Progress monitoring and failure handling
System 2 is explicitly designed around the assumption that execution will sometimes fail. Policies stall, objects move, and hardware occasionally behaves unpredictably. Instead of treating failure as an exception, System 2 treats it as part of the normal decision space, adapting to ensure steady progress toward higher-level goals.
Every tool invocation is monitored via progress signals, including value predictions from System 1 policies and status reports from navigation or perception systems. Based on these signals, System 2 can decide to continue, retry, switch strategy, or escalate to System 3.
From an operational perspective, this is what makes single-robot autonomy viable in real environments. The robot does not need perfect models or perfect policies. It only needs a sufficiently rich toolbox and a control layer that can adapt when reality diverges from expectation.
SYSTEM 2
Human-in-the-loop execution
System 2 is also the primary interface through which humans interact with individual robots. Rather than treating human involvement as a separate control mode, System 2 integrates human input directly into the same decision framework used for autonomous execution.
It’s just a matter of who is using the tools.
In practice, this means humans could steer robot behavior by intent. Operators intervene via natural-language with new instructions, refinements, hints, or corrections, without needing to fully tele-operate. System 2 integrates these updates into its context and re-plans accordingly.
These interventions are captured to feed the data flywheel to enable our systems to recover automatically in the future.
SYSTEM 2
Discrete vs continuous control
A key architectural choice in KinetIQ is that System 2 operates purely over discrete symbolic state. Continuous control, perception, and low-level planning are deliberately pushed into Systems 1 and 0. It also means System 2 is much more robust to network latency and jitter, enabling the use of larger and slower frontier models.
This abstraction boundary gives us three important properties.
First, it allows System 2 to generalize across embodiments: the same executive logic can run on different robots as long as they expose compatible tools.
Second, it makes behavior interpretable and debuggable, since decisions are made in a human-readable task space.
Third, it enables rapid capability expansion: new policies or controllers become immediately usable by System 2 without modifying its internal logic.
System 3
System 3 is the fleet-level orchestration layer of KinetIQ. Its role is to coordinate many different robots, tasks, and humans into coherent long-horizon behavior. While System 2 turns a single robot into an adaptive agent, System 3 turns a collection of agents into a complete system.
System 3 operates at the level of work: orders, jobs, objects, locations, and human interventions. It does not reason about trajectories, grasps, or perception directly. Instead, it reasons about assignments, dependencies, progress, and outcomes.
Work is broken up into tasks which are then assigned to specific robots.
This separation allows the same System 3 layer to coordinate many different robots with diverse capabilities, evolve workflows over time, and remain robust to hardware variability and partial observability in the physical world.
SYSTEM 3
Workflow-centric execution
The core abstraction in System 3 is the workflow. A workflow represents a structured but dynamic description of how work should progress through the system, including which operations are possible, what dependencies exist between them, and which agents or resources can perform them.
Unlike traditional automation pipelines or static task graphs, workflows in KinetIQ are not fixed programs or scripts. They are living objects that can evolve at runtime as new information arrives: robots fail or recover, tasks take longer than expected, humans intervene, and priorities shift. System 3 continuously reshapes execution by enabling or disabling transitions in the workflow based on the current global state.
Conceptually, our workflows are closest to Petri-net-style systems with memory and policies [10]. A Petri net consists of places, which represent system states; transitions, which represent tasks or operations that may occur when some conditions are met; and tokens, which mark the current state of execution.
Tokens represent discrete entities such as robots, jobs, or objects which flow through the net as work progresses.
Together, these elements provide a simple model for reasoning about concurrency, coordination, and resource flow. This design lets us operate in the regime that real deployments require: partially observable, non-deterministic, and dynamic environments.
SYSTEM 3
Assignment and scheduling
At runtime, System 3 solves a continuous assignment problem. Rather than computing a single optimal global plan, System 3 advances work items through the workflow to maximize throughput, minimize latency, or optimize some other system-level utility function under real-world constraints.
Execution is inherently incremental. Tasks enter the system over time, robots become available or unavailable, and the environment changes in ways that cannot be perfectly predicted. System 3 therefore maintains a rolling allocation of work, continuously re-evaluating assignments as new information arrives. Decisions are soft and reversible: when conditions change, the system can re-route work, reshuffle priorities, or rebind tasks to different robots without restarting execution or invalidating global state.
This flexibility is what allows workflows to be deployed and evolved gradually rather than engineered upfront.
Instead of requiring a complete specification of all processes before deployment, System 3 can start from minimal structure—basic capabilities, simple task definitions, coarse constraints—and progressively refine orchestration through interaction with the real world. New tasks, resources, and policies can be introduced gradually and immediately participate in the same scheduling and execution logic.
SYSTEM 3
Integrating with external systems
In real deployments, System 3 does not operate in isolation. It sits within a broader ecosystem of external systems—enterprise software, operational dashboards, scheduling tools, inventory systems, and human-facing control interfaces. These systems continuously inject new information, constraints, and objectives into the orchestration layer as business conditions evolve, forcing plans to adapt in near-real time.
System 3 is designed to treat all such external inputs as first-class transitions in the workflow. This includes events such as:
- new jobs arriving from upstream business systems,
- priority changes from operational control planes,
- task cancellations or reassignments,
- manual overrides from fleet operators,
- and system-level interventions such as pausing, resuming, or reconfiguring execution.
All of these inputs are integrated into the same orchestration logic as autonomous decisions. They update the workflow state, trigger re-evaluation of assignments, and reshape execution in a controlled and consistent way.
This design has two important consequences:
First, System 3 becomes a stable integration layer between robotics and the broader enterprise stack, without hard-coding assumptions about specific robots, facilities, or workflows. This allows System 3 to handle exceptions through direct interaction with facility management systems—requesting real-time guidance when unexpected or ambiguous situations arise, rather than encoding brittle decision logic in advance.
Second, System 3 establishes a bidirectional feedback loop with external systems. All inbound interactions—overrides, adjustments, re-prioritizations—are captured as structured execution data. Over time, this allows System 3 to learn how operational systems actually shape behavior: which interventions are common, where friction occurs, and how global workflows should evolve to better align robotic execution with real business processes.
In the opposite direction, System 3 continuously reports task execution status, warnings, errors, metrics, and optimization signals back to facility management systems. This provides an accurate, real-time representation of the operational state and creates a concrete foundation for efficiency gains and longer-term business process transformation.
SYSTEM 3
Workflow inference and learning from execution
Every fleet produces rich traces of events, state transitions, assignments, failures, and external interventions, forming a structured execution history that can be analyzed to infer how work is actually performed across the system.
This class of approach is well established in large-scale operational systems, where similar techniques are used to recover implicit process structure, identify bottlenecks, and refine coordination logic based on real execution data.
With KinetIQ, these traces provide the foundation for evolving orchestration over time. Rather than relying exclusively on hand-specified workflows and static fleet logic, System 3 is designed to progressively incorporate execution data into how workflows are represented, scheduled, and optimized, shifting from manually engineered orchestration toward more adaptive, data-driven coordination.
Conclusion
We have provided a high-level overview of KinetIQ, Humanoid’s cognitive AI architecture. With this stack, our goal is to build a true capability factory: a scalable, cost-efficient way to rapidly create and harden complex fleet-level behaviors that meet strict deployment standards. KinetIQ is a fully agentic system designed for a domain where there is no well-defined playbook—one that requires continuous experimentation, integration, and refinement as we push toward new classes of capability. We believe this approach is a key factor behind Humanoid’s rapid progress toward solving Physical AI.
Much work still lies ahead before KinetIQ vision is fully realized. Fully bringing it to life is not just a robotics or deep learning challenge: building general-purpose Physical AI that can be deployed at scale requires sustained progress across perception, learning, controls, data, and systems, as well as the engineering needed to make these advances reliable in the real world.
At the same time, robotics is entering an unusually exciting phase, where new capabilities are unlocked at a rapid pace, and overall AI progress makes things that were hard yesterday trivial today.
If the possibilities ahead excite you, come build the future with us!
References
Li, Jialong, et al. "AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control." arXiv preprint arXiv:2505.03738.
Liao, Qiayuan, et al. "Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion." arXiv preprint arXiv:2508.08241.
Bjelonic, Filip, Fabian Tischhauser, and Marco Hutter. "Towards bridging the gap: Systematic sim-to-real transfer for diverse legged robots." arXiv preprint arXiv:2509.06342.
Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, Shuran Song. “Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots.” arXiv:2402.10329.
Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, Elvis Nava. “mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs.” arXiv:2512.15692.
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, Jinwei Gu. “Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning.” arXiv:2601.16163.
Tonghe Zhang Tonghe, Chao Yu, Sichang Su, Yu Wang. “ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning.” arXiv:2505.22094.
Tonghe Zhang Tonghe, Chao Yu, Sichang Su, Yu Wang. “ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning.” arXiv:2505.22094.
Kevin Black, Manuel Y. Galliker, Sergey Levine. "Real-Time Execution of Action Chunking Flow Policies." arXiv:2506.07339.
Kevin Black, Allen Z. Ren, Michael Equi, Sergey Levine. "Training-Time Action Conditioning for Efficient Real-Time Chunking." arXiv:2512.05964.