Inside Meta's scramble to catch up on AI
Send a link to a friend
[April 25, 2023] By
Katie Paul, Krystal Hu, Stephen Nellis and Anna Tong
(Reuters) - As the summer of 2022 came to a close, Meta CEO Mark
Zuckerberg gathered his top lieutenants for a five-hour dissection of
the company's computing capacity, focused on its ability to do
cutting-edge artificial intelligence work, according to a company memo
dated Sept. 20 reviewed by Reuters.
They had a thorny problem: despite high-profile investments in AI
research, the social media giant had been slow to adopt expensive
AI-friendly hardware and software systems for its main business,
hobbling its ability to keep pace with innovation at scale even as it
increasingly relied on AI to support its growth, according to the memo,
company statements and interviews with 12 people familiar with the
changes, who spoke on condition of anonymity to discuss internal company
matters.
"We have a significant gap in our tooling, workflows and processes when
it comes to developing for AI. We need to invest heavily here," said the
memo, written by new head of infrastructure Santosh Janardhan, which was
posted on Meta's internal message board in September and is being
reported now for the first time.
Supporting AI work would require Meta to "fundamentally shift our
physical infrastructure design, our software systems, and our approach
to providing a stable platform," it added.
For more than a year, Meta has been engaged in a massive project to whip
its AI infrastructure into shape. While the company has publicly
acknowledged "playing a little bit of catch-up" on AI hardware trends,
details of the overhaul - including capacity crunches, leadership
changes and a scrapped AI chip project - have not been reported
previously.
Asked about the memo and the restructuring, Meta spokesperson Jon
Carvill said the company "has a proven track record in creating and
deploying state-of-the-art infrastructure at scale combined with deep
expertise in AI research and engineering."
"We're confident in our ability to continue expanding our
infrastructure's capabilities to meet our near-term and long-term needs
as we bring new AI-powered experiences to our family of apps and
consumer products," said Carvill. He declined to comment on whether Meta
abandoned its AI chip.
Janardhan and other executives did not grant requests for interviews
made via the company.
The overhaul spiked Meta's capital expenditures by about $4 billion a
quarter, according to company disclosures - nearly double its spend as
of 2021 - and led it to pause or cancel previously planned data center
builds in four locations.
Those investments have coincided with a period of severe financial
squeeze for Meta, which has been laying off employees since November at
a scale not seen since the dotcom bust.
Meanwhile, Microsoft-backed OpenAI's ChatGPT surged to become the
fastest-growing consumer application in history after its Nov. 30 debut,
triggering an arms race among tech giants to release products using
so-called generative AI, which, beyond recognizing patterns in data like
other AI, creates human-like written and visual content in response to
prompts.
Generative AI gobbles up reams of computing power, amplifying the
urgency of Meta's capacity scramble, said five of the sources.
FALLING BEHIND
A key source of the trouble, those five sources said, can be traced back
to Meta's belated embrace of the graphics processing unit, or GPU, for
AI work.
GPU chips are uniquely well-suited to artificial intelligence processing
because they can perform large numbers of tasks simultaneously, reducing
the time needed to churn through billions of pieces of data.
However, GPUs are also more expensive than other chips, with chipmaker
Nvidia Corp controlling 80% of the market and maintaining a commanding
lead on accompanying software, the sources said.
Nvidia did not respond to a request for comment for this story.
Instead, until last year, Meta largely ran AI workloads using the
company's fleet of commodity central processing units (CPUs), the
workhorse chip of the computing world, which has filled data centers for
decades but performs AI work poorly.
According to two of those sources, the company also started using its
own custom chip it had designed in-house for inference, an AI process in
which algorithms trained on huge amounts of data make judgments and
generate responses to prompts.
By 2021, that two-pronged approach proved slower and less efficient than
one built around GPUs, which were also more flexible in running
different types of models than Meta's chip, the two people said.
Meta declined comment on its AI chip's performance.
As Zuckerberg pivoted the company toward the metaverse - a set of
digital worlds enabled by augmented and virtual reality - its capacity
crunch was slowing its ability to deploy AI to respond to threats, like
the rise of social media rival TikTok and Apple-led ad privacy changes,
said four of the sources.
[to top of second column] |
A security guard stands watch by the
Meta sign outside the headquarters of Facebook parent company Meta
Platforms Inc in Mountain View, California, U.S. November 9, 2022.
REUTERS/Peter DaSilva/File Photo
The stumbles caught the attention of former Meta board member Peter
Thiel, who resigned in early 2022, without explanation.
At a board meeting before he left, Thiel told Zuckerberg and his
executives they were complacent about Meta's core social media
business while focusing too much on the metaverse, which he said
left the company vulnerable to the challenge from TikTok, according
to two sources familiar with the exchange.
Meta declined to comment on the conversation.
CATCH-UP
After pulling the plug on a large-scale rollout of Meta's own custom
inference chip, which was planned for 2022, executives instead
reversed course and placed orders that year for billions of dollars
worth of Nvidia GPUs, one source said.
Meta declined to comment on the order.
By then, Meta was already several steps behind peers like Google,
which had begun deploying its own custom-built version of GPUs,
called the TPU, in 2015.
Executives also that spring set about reorganizing Meta's AI units,
naming two new heads of engineering in the process, including
Janardhan, the author of the September memo.
More than a dozen executives left Meta during the months-long
upheaval, according to their LinkedIn profiles and a source familiar
with the departures, a near-wholesale change of AI infrastructure
leadership.
Meta next started retooling its data centers to accommodate the
incoming GPUs, which draw more power and produce more heat than
CPUs, and which must be clustered closely together with specialized
networking between them.
The facilities needed 24 to 32 times the networking capacity and new
liquid cooling systems to manage the clusters' heat, requiring them
to be "entirely redesigned," according to Janardhan's memo and four
sources familiar with the project, details of which have not
previously been disclosed.
As the work got underway, Meta made internal plans to start
developing a new and more ambitious in-house chip, which, like a GPU,
would be capable of both training AI models and performing
inference. The project, which has not been reported previously, is
set to finish around 2025, two sources said.
Carvill, the Meta spokesperson, said data center construction that
was paused while transitioning to the new designs would resume later
this year. He declined to comment on the chip project.
TRADE-OFFS
While scaling up its GPU capacity, Meta, for now, has had little to
show as competitors like Microsoft and Google promote public
launches of commercial generative AI products.
Chief Financial Officer Susan Li acknowledged in February that Meta
was not devoting much of its current compute to generative work,
saying "basically all of our AI capacity is going towards ads, feeds
and Reels," its TikTok-like short video format that is popular with
younger users.
According to four of the sources, Meta did not prioritize building
generative AI products until after the launch of ChatGPT in
November. Even though its research lab FAIR, or Facebook AI
Research, has been publishing prototypes of the technology since
late 2021, the company was not focused on converting its
well-regarded research into products, they said.
As investor interest soars, that is changing. Zuckerberg announced a
new top-level generative AI team in February that he said would "turbocharge"
the company's work in the area.
Chief Technology Officer Andrew Bosworth likewise said this month
that generative AI was the area where he and Zuckerberg were
spending the most time, forecasting Meta would release a product
this year.
Two people familiar with the new team said its work was in the early
stages and focused on building a foundation model, a core program
that later can be fine tuned and adapted for different products.
Carvill, the Meta spokesperson, said the company has been building
generative AI products on different teams for more than a year. He
confirmed that the work has accelerated in the months since
ChatGPT's arrival.
(Reporting by Katie Paul, Krystal Hu, Stephen Nellis and Anna Tong;
additional reporting by Jeffrey Dastin; editing by Kenneth Li and
Claudia Parsons)
[© 2023 Thomson Reuters. All rights
reserved.]
This material may not be published,
broadcast, rewritten or redistributed.
Thompson Reuters is solely responsible for this content. |