Perplexity Launches Hybrid AI Workload Management System

Perplexity has introduced a new AI system that automatically balances workloads between local devices and cloud models, set to launch in July.

Summary

Perplexity unveiled its "hybrid agentic inference" system at Computex 2026, designed to automatically distribute AI tasks between local devices and cloud-based models without user intervention.
This innovative feature will be available in Perplexity Computer starting July, having been demonstrated on Intel Core Ultra Series 3 processors and is currently exclusive to the Windows PC application.
CEO Aravind Srinivas emphasized the importance of cost efficiency, noting that the company’s revenue surged to $500 million while the workforce grew by only 34%; delegating inference to user hardware helps maintain this balance.

During the Computex 2026 event in Taipei on June 2, Perplexity CEO Aravind Srinivas, alongside Intel CEO Lip-Bu Tan, announced what they claim is the first hybrid local-server inference orchestrator. This system, set to debut in Perplexity Computer in July, automatically determines which aspects of an AI task should be executed on a user’s device and which should be processed by more advanced cloud models, all without requiring user input.

“Today we're announcing the next evolution for Personal Computer: the first hybrid local-server inference orchestrator,” stated Perplexity. “It intelligently decides which tasks run on local devices and which are sent to cloud agents, directing each segment of a task to the appropriate location.”

According to Perplexity, the ideal goal for an AI system is to maximize token value per watt for each individual user. However, three conflicting factors complicate this: accuracy necessitates the use of the most powerful models, privacy requires that certain data remains on the local device, and cost pressures demand that users do not utilize a high-end model for simpler tasks.

The solution, termed "hybrid agentic inference," simultaneously tackles these challenges. A lightweight model operates on the user's device, acting as a traffic director that decides which data is sensitive and should remain local, and which tasks require the capabilities of a cloud-based frontier model.

“Hybrid agentic inference is ideal for tasks that involve sensitive information but still need robust AI capabilities, such as financial documents, health records, and personal files,” the company elaborated. “The local model determines when sensitive data must stay on your device, while more demanding tasks are processed on the server.”

Why is this important?

Inference—the act of executing a trained AI model to produce a response—occurs whenever a user interacts with a chatbot. Currently, this process predominantly takes place on remote servers managed by AI companies, which means that sensitive information like financial documents and health inquiries is sent to external computers before receiving a response.

This is why many AI systems feature “Auto” or “low thinking” modes, as companies often push users toward the least expensive processing options available.

Srinivas candidly addressed this issue in an interview with Bloomberg Television at Computex, stating, "You don't want all your computing centralized in servers and relying on the largest models. Some organizations are spending up to half a billion dollars monthly. What you really need is effective value per watt per user." By shifting inference tasks to user devices, Perplexity can lower operational costs.

Local inference benefits companies by significantly reducing expenses, while also providing a crucial advantage for users: it keeps their data on their machines. The traditional compromise has been performance—local models tend to be less powerful than those hosted in data centers.

Perplexity's orchestrator aims to achieve the best of both worlds. Routine tasks—such as summarizing previously written documents, text formatting, and lightweight classifications—are handled locally, while more complex reasoning is directed to the cloud, ideally without including sensitive data. The company claims this process occurs automatically and seamlessly during tasks, though the effectiveness of this system will be evaluated once it launches in July.

It is important to clarify that this does not involve Perplexity offering a fully open-source local model under user control. The local element is a compact model integrated within Perplexity's application, while the cloud component continues to operate through Perplexity's servers. Users seeking a completely offline and self-hosted solution—similar to what projects like MiniCPM5-1B provide—will not find that option here.

To provide some context, Perplexity's revenue has increased from $100 million to $500 million, with only a 34% rise in employee numbers, as announced by Srinivas in April. A company that distributes queries across models it does not train has a strong incentive to minimize computing expenditures. By partially transferring the inference load to users’ devices—of which billions are already in use—it offers a practical solution. The privacy benefits are genuine, but they align conveniently with the financial ones.

Who else is pursuing this strategy?

Many leading AI companies are currently advancing towards on-device or hybrid inference. Apple’s AI processes its most sensitive tasks locally using M-series chips. Microsoft’s Foundry Local became generally available in April 2026, allowing full AI inference on Windows, macOS, and Linux without relying on the cloud.

Nvidia also announced RTX Spark at the same Computex event where Perplexity revealed its system, focusing on local LLM inference for laptops and desktops. Google’s approach has drawn criticism; as Decrypt reported, Chrome secretly installed a 4GB Gemini Nano model without user consent, and the "AI Mode" button commonly seen by users does not utilize it at all.

Perplexity's unique aspect lies in its orchestration layer. Instead of requiring users to choose between local or cloud processing at the outset, the system makes this decision dynamically for each task. Srinivas mentioned that the method is "chip agnostic"—the Computex demonstration utilized Intel Core Ultra Series 3 processors, but Nvidia processors are also compatible. Currently, this feature is exclusive to the Perplexity application for Windows PCs, with no confirmed timeline for wider release.

Daily Debrief Newsletter

Stay updated with the latest news stories, original features, podcasts, videos, and more every day.

Perplexity Introduces Hybrid AI Workload Management for PCs

Summary

Daily Debrief Newsletter