• The Unfold
  • Posts
  • 📁: Meta got caught misleading AI benchmarks

📁: Meta got caught misleading AI benchmarks

Meta's new flagship AI model, Maverick, ranks #2 on LM Arena—an evaluation using human raters. However, the version tested may differ from the one available to developers.

In partnership with

Hey everyone,

Just a quick heads-up: Meta’s new model Llama 4 might be misleading with their benchmarks. Also, Google is paying their AI staff to do nothing 😂 .

What’s inside today’s newsletter:

  • 🌐 Tech Pulse: Google is allegedly paying some AI staff to do nothing for a year rather than join rivals.

  • 📁 Unfold AI Tricks: Turn Meeting Notes into Action Items with Otter.ai.

  • 5 New Tools on the Block.

  • 🔎 Spotlight: Meta got caught misleading AI benchmarks.

  •  To-dos.

🌐 Tech Pulse

📁 Unfold AI Tricks: Turn Meeting Notes into Action Items with Otter.ai

Overview: Drowning in meeting notes and struggling to keep track of key takeaways? Otter.ai automatically transcribes your meetings and highlights important action items—saving you time and keeping your team aligned.

Duration: 5-10 minutes

Skill Level: Beginner

Steps:

  1. Create a free Otter.ai account
    Head to otter.ai and sign up using your email or Google account.

  2. Upload or record a meeting
    Click on "Import" to upload an existing audio/video file, or use Otter to record a live meeting via Zoom, Google Meet, or your browser.

  3. Let Otter transcribe it
    Otter will automatically transcribe the audio into text, complete with speaker identification and timestamps.

  4. Use the Summary feature
    Once the transcription is complete, click on “Summary” to instantly get key points and action items extracted from the conversation.

  5. Copy & share
    Copy the summary or export it to share with your team, or paste it directly into tools like Notion, Slack, or Trello.

Pro Tip: Tag key team members during live transcriptions (e.g., “@Alex will handle the design update”) to make action items more clear and assignable later.

💼 From our Partners

You’ve heard the hype. It’s time for results.

After two years of siloed experiments, proofs of concept that fail to scale, and disappointing ROI, most enterprises are stuck. AI isn't transforming their organizations — it’s adding complexity, friction, and frustration.

But Writer customers are seeing positive impact across their companies. Our end-to-end approach is delivering adoption and ROI at scale. Now, we’re applying that same platform and technology to build agentic AI that actually works for every enterprise.

This isn’t just another hype train that overpromises and underdelivers.
It’s the AI you’ve been waiting for — and it’s going to change the way enterprises operate. Be among the first to see end-to-end agentic AI in action. Join us for a live product release on April 10 at 2pm ET (11am PT).

Can't make it live? No worries — register anyway and we'll send you the recording!

🔧 New Tools on the Block

  1. Flexprice: Usage-based pricing and metering for developers.

  2. Databutton MCP: Give your app AI superpowers with MCPs.

  3. PyUI Builder: Build Python GUIs like Canva.

  4. Scout: Instant Marketplace Alerts.

  5. Amazon Buy for Me: AI Agent that shops in other sites.

  6. Thumb Zone AI: Emotion AI for user testing & product experience.


    Want to sponsor your tool in our newsletter? CLICK HERE

    *sponsored

🔎 Spotlight: Meta got caught misleading AI benchmarks

Summary: Meta seemingly manipulated AI benchmarks for its new Llama 4 model, Maverick, to create the impression of superior performance. By submitting an "experimental" version optimized for conversational ability to the human preference-based LMArena benchmark, Meta achieved a high ranking that didn't reflect the capabilities of the publicly available model. This discrepancy sparked criticism from the AI research community and LMArena itself, leading to policy updates aimed at preventing such misleading evaluations in the future. The incident highlights the increasingly competitive landscape of AI development and the pressure on companies to demonstrate leadership, even through questionable tactics that undermine the reliability of benchmarks.

Here’s why it matters:

  • Meta submitted an "experimental chat version" of Maverick to LMArena, not the publicly available model, making the benchmark results potentially misleading.

  • This action raises concerns about "gaming" AI benchmark systems, which are intended to provide fair and reproducible evaluations of model performance.

  • The discrepancy between the benchmarked model and the public version creates confusion for developers who rely on such rankings to guide their choice of AI models for applications.

 😂 Tech Memes

To-dos

Do you have a topic in mind? We’re always open to suggestions—let us know what you want to learn next!

How was Today's Newsletter

If this issue was WSJ article, How would you rate it?

Login or Subscribe to participate in polls.

If you like today’s issue, consider subscribing to us.

That’s a wrap! Catch you in tomorrow’s edition. 👋

—Harman

P.S. If you haven’t filled out the subscriber survey yet, take a minute to do it. It helps me tailor this newsletter to what you care about.

Reply

or to participate.