
During WWDC25, Apple announced new versions of its on-device and cloud-based foundation models. Now, they have published a tech report detailing how those models were trained, optimized, and evaluated. And the report includes some genuinely interesting under-the-hood tidbits.
In a comprehensive document called “Apple Intelligence Foundation Language Models – Tech Report 2025“, the company walks through multiple aspects of the new models, including their architecture, data sources, pre-training, post-training, tool use development, optimizations, and benchmarks.

It is a very technical, but very worthwhile read if you like to get into the nuts and bolts of this sort of stuff. Here are a few particularly interesting highlights.
The local model was split into two blocks
We already knew that Apple’s on-device model (the one developers will get to tap into) has around 3 billion parameters. Now, the company has detailed that this model is actually divided into two blocks:
“Block 1 contains 62.5% of the total transformer layers, while Block 2 contains the remaining 37.5% of the transformer layers, but had the key and value projections removed.”
In practice, this means that the local model requires 37.5% less memory for caching, and the time it takes to output the first token (basically, a fragment of a word) was also cut by about 37.5%. Still, Apple structured the split in a way that it says preserves the model’s overall performance and output quality.

As a side note, a few years ago, Apple published this study, which looked at swapping parts of an LLM between RAM and flash storage as needed, in order to pack a local model that was bigger than what would otherwise fit on the device’s memory.
While Apple ultimately took a different route, it is interesting to note the different ways the company has been experimenting to offer good local performance, even on memory-constrained devices.
The cloud-based model has a creative architecture
For its server model, Apple built a custom architecture that was tailor-made for its Private Cloud Compute platform. It’s called Parallel-Track Mixture-of-Experts (PT-MoE), and the way it works is pretty neat.
In a nutshell (and at the risk of oversimplifying things), Mixture of Experts is when, instead of relying on one huge AI model, it’s split into smaller subnetworks (or experts) that are only activated when the task is related to something they’re… well, an expert in.
So if your prompt is about cooking, only cooking-related experts are activated, while others remain dormant. The result is still a massive overall model, but its modular design allows it to respond faster (and often more accurately) than if everything were running through the huge, unified model, for every prompt.
Here is an IBM Mixture of Experts explainer, in case you have 8 minutes to spare:
Apple built a new kind of Transformer called the Parallel Track Transformer, then scaled it up with Mixture of Experts (MoE) layers. That sounds way too complicated, but the gist of it is:
Traditional Transformers process tokens through a single stack of layers, one after the other. But rather than using this single-track approach to calculate every token, Apple’s design splits the model into multiple, parallel tracks. Each track processes tokens independently, and only syncs up at certain points.
Then, inside each of those tracks, Apple replaced every other regular transformer layer with an MoE layer, which activates just a few experts for each token, while the rest stay idle. And because each track has its own local experts, the model avoids the processing bottlenecks that happen when everything has to coordinate across the entire system.

Add to that a clever setup that balances local context with big-picture understanding (called Interleaving Global and Local Attention Layers), and the result is a very modular, efficient, and scalable model that’s faster and leaner, but still pretty smart.
Apple increased multilingual representation by 275%
One of the biggest knocks against the initial rollout of Apple Intelligence was (and still is) limited language support beyond English. With its new models, Apple has expanded language support, and the document details the steps it took in order to do that.
According to the document, Apple increased the amount of multilingual data used during training from 8% to 30%. This includes both organic and synthetic content.
Apple also increased its tokenizer (which is basically the model’s token vocabulary) by 50%. This means that its model now knows 150K different tokens, up from the previous 100K.
The company says that these changes led to “significant gains” in performance across non-English benchmarks, especially after reinforcement learning fine-tuning.
In the deocument, Apple explains that evaluations were conducted using prompts written by native speakers (rather than translations), and the model was tested on both accuracy and how natural its responses sounded in local contexts. If this sounds familiar, you probably read our recent coverage of this Apple Research study.
In practice, all of this means that features like Writing Tools should work more reliably in the supported languages.

From where did Apple source its data?
Like with its first models, most of the training data came from crawling the web. But Apple says that its Applebot crawler respects robots.txt
exclusions, meaning that if a website doesn’t want Apple to scrape its content, it can say so, and Applebot will leave it alone.
That said, here is how Apple says it sourced the data for its new models:
- Publicly available web data: Although Apple doesn’t specify quantities or ratios, it does say that the largest portion of its training data came from Applebot crawling web pages. Apple applied multiple layers of filtering to remove low-quality, unsafe, or irrelevant content, including spammy pages, shallow or templated text, and broken formatting.
- Licensed data: Apple doesn’t go into much detail here, but does confirm that some of the training data was licensed from publishers. Earlier reports had suggested that Apple had been negotiating with Condé Nast (The New Yorker, Vogue, Wired, etc.), NBC News, and IAC (People Magazine, The Daily Beast, and Better Homes and Gardens, etc.), so it’s likely that at least some of that material made it in.
- Synthetic data: Apple generated synthetic data using smaller models and custom pipelines, particularly for math, code, instruction tuning, and vision-language tasks. While the company also doesn’t specify how much of the dataset this represented, it notes that synthetic data played a large role in key training steps like fine-tuning, reinforcement learning, and improving multilingual support. And If you’re wondering what synthetic data just means “made-up stuff,” we’ve got an explainer on why that’s not the case.
- Visual data: To support image understanding, Apple collected over 10 billion image–caption pairs, including screenshots with OCR, and handwritten notes. It also used its own models to generate additional, richer captions. In the past, it was reported that Apple had held licensing talks with Shutterstock, so it’s possible some of that material also made it in.
9to5Mac’s take
There has been no shortage of news on Apple’s internal drama, technical struggles, and overall inability to gain the momentum it needs to bridge the gap (which some might call a chasm) between its AI offerings, and the competition. All of those are true.
Yet, the fact that Apple is largely perceived as being behind on AI doesn’t mean the company is standing still. This report offers an interesting insight into the under-the-hood improvements (and shortcomings) of Apple’s newest models, along with extensive details on a privacy-conscious approach that few companies are even attempting.
FTC: We use income earning auto affiliate links. More.
Source link