OpenAI HealthBench and Codex | Weekly edition

PLUS HOT AI Tools & Tutorials

May 16, 2025

Hey there! Welcome to your weekly AI catch-up.

Big moves this week: OpenAI’s new HealthBench is crushing it, beating Gemini and Claude in real doctor tests. Google’s making Gemini unavoidable, bringing it to your watch, TV, car, and XR headset. Trump’s policy changes sparked a $600B Saudi AI gold rush, and Google’s AlphaEvolve just solved historic math problems (yep, with more AI).

On the tools side, fresh coding assistants and workflow agents are dropping nonstop. And trust me—the meme of the week? Absolute gold.

Let’s dive in!

This Creators’ AI Edition:

Featured Materials 🎟️
News of the week 🌍
Useful tools ⚒️
Weekly Guides 📕
AI Meme of the Week 🤡
AI Tweet of the Week 🐦
(Bonus) Materials 🎁

Find out where your customers are looking

Optimize designs pre-launch with AI-driven insights. Attention Insight gives you the power to see design through the eyes of your customers. Reduce friction, gain stakeholder confidence, and get your designs approved faster than ever.

20% Off use CREATORSAI20

Featured Materials 🎟️

Introducing HealthBench | An evaluation for AI systems and human health

OpenAI has just introduced HealthBench, a brand-new benchmark for evaluating AI in healthcare. But it’s not just a simple test — it’s a full system with over 5,000 realistic medical dialogues, created with input from 262 doctors across 60 countries. Each AI response is judged using a detailed rubric crafted by real medical professionals to see how well the AI can deliver useful, safe, and accurate recommendations in real-world situations.

The latest results show that the o3 model outperforms others, including Claude 3.7 Sonnet and Gemini 2.5 Pro (as of March 2025). In recent months, frontier models have improved their HealthBench scores by 28%—a major leap in both safety and performance. This improvement is even greater than the jump observed between GPT-3.5 Turbo and GPT-4o (August 2024).

Better health models could make the biggest difference in places with fewer resources—but only if people can use them. That’s why it’s important to look at both cost and performance when talking about scaling up models.

If you look at a scatter plot of cost vs. score, you’ll see most points form a diagonal—higher cost usually means a better score. But things are changing fast! OpenAI’s April 2025 models (o3, o4-mini, GPT‑4.1) set a new standard for getting great performance without sky-high costs. Small models, in particular, have made big strides lately: GPT‑4.1 nano actually outperforms last year’s GPT‑4o, while being 25 times cheaper to use. And when you compare o3, o4-mini, and o1 models at different levels of reasoning power, you see that more computing at test time leads to even better results. It looks like new reasoning-focused models will keep pushing these boundaries even further in the coming months.

Turns out, HealthBench’s grading lines up really closely with how real doctors grade, which means it does a solid job reflecting expert opinions.

To double-check how good our model-based grading is, OpenAI had physicians review answers in HealthBench Consensus to see if they met the rubric’s standards. Basically, we created “meta-evaluations”—these are checks to see if our model’s grading matches what doctors would say. OpenAI looked at how often the model agreed with physicians, and also how often physicians agreed with each other. The result? The model and the doctors were just about as likely to agree with each other as the doctors were with one another.

HealthBench is definitely a new leap forward for the medical industry. We believe this is just the beginning of real synergy between AI and healthcare. What’s your take on this? Would you be open to visiting an AI doctor, or are you still feeling skeptical about it?

Gemini Goes Everywhere: Google’s AI Invades Watches, TVs, Cars, and Beyond

Google just announced it’s bringing its Gemini AI assistant to a whole bunch of new places, like Android smartwatches, TVs, cars, and even its upcoming XR headset. Soon, you’ll be able to talk to Gemini on your Wear OS watch, get content recommendations and quick answers on Google TV, and let the AI handle stuff like finding destinations or reading messages in your car with Android Auto. Plus, the new XR headset will launch with Gemini built right in for some pretty cool, immersive experiences.

Google is finally rolling out advanced AI everywhere you use Android, making Gemini the smart layer tying all your devices together—something Apple hasn’t quite caught up with yet.

News of the week 🌍

Trump Unleashes Saudi AI Boom: $600B in US Tech Deals After Chip Ban Lifted

Saudi Arabia is making a major move to become an AI superpower after Trump lifted restrictions on advanced US chip exports during his Gulf tour. Following the launch of the kingdom’s new AI startup Humaine, NVIDIA, AMD, and AWS quickly signed $600 billion in deals, including a huge shipment of top NVIDIA chips. AMD and Amazon are also investing billions to help Saudi build up its AI infrastructure. While some in the US are worried China could gain access to the tech, Trump claims the partnerships will benefit both countries, with a $20 billion Saudi investment flowing into US AI data centers.

Google Unveils AlphaEvolve: AI Agent Cracks Historic Math Problems and Boosts Efficiency

Google just rolled out AlphaEvolve, a next-gen coding agent that uses Gemini and evolutionary strategies to create algorithms for tough scientific and math problems, already making Google more efficient and even cracking decades-old math challenges. AlphaEvolve combines different Gemini models to brainstorm and refine code, and it’s already found improvements to famous algorithms like Strassen’s from 1969, boosted Google’s data center and AI training efficiency, and helped with chip design. When tested on over 50 open math problems, it matched top solutions three-quarters of the time and discovered brand new, better solutions for another 20%. With AlphaEvolve, Google is pushing AI from theory to real, groundbreaking scientific discovery.

OpenAI Launches GPT-4.1 in ChatGPT, Boosts Coding and Transparency

OpenAI is rolling out its new GPT-4.1 and GPT-4.1 mini models in ChatGPT, promising faster and even better coding help for users. GPT-4.1 is now available to ChatGPT Plus, Pro, and Team subscribers, while GPT-4.1 mini is free for everyone. As part of the update, GPT-4.0 mini is being removed. The company faced some criticism for not releasing a full safety report at launch, but says GPT-4.1 isn’t a “frontier model,” so it didn’t need one. OpenAI now plans to share more about its model safety testing through a new public Safety Evaluations Hub. The move comes as AI-powered coding tools are heating up, with OpenAI reportedly close to acquiring Windsurf and Google upgrading its Gemini chatbot to better connect with GitHub.

TikTok Debuts AI Alive: Turn Photos into Animated Videos in Stories

TikTok just launched “AI Alive,” a new tool that lets users turn their photos into animated videos right inside TikTok Stories. Using the Story Camera, you can pick any photo and let the AI add movement, effects, and even sound, like making clouds drift or waves crash in a beach pic. All AI-generated videos will be labeled and include metadata to show they’re made with AI, even if shared outside TikTok. The company says safety is a top priority, with built-in checks to make sure new creations follow the rules.

Google’s Gemini Now Connects Directly to GitHub for Advanced Coding Help

Google’s Gemini chatbot just got a big upgrade for subscribers—now, users on the Gemini Advanced plan can link Gemini directly to any public or private GitHub repo. This makes it easier to generate, explain, and debug code right from your projects. Just hit the “+” in Gemini, import your GitHub URL, and start collaborating with AI on your codebase. While Gemini is now more powerful for coding, Google reminds users that AI code tools can still introduce errors, so always double-check the output

Introducing Codex

OpenAI just released a cloud-based software engineering agent powered by Codex-1. It can handle multiple coding tasks at once and is available now for ChatGPT Pro, Team, and Enterprise users; Plus users getting access soon 😢.

Useful tools ⚒️

mrge - Cursor for code review

Daytona - Secure and elastic infra for running your AI-generated code.

Airpost -Turn your product link into 30+ video ads

Salespeak Website AI Grader - Test your website’s AI compatibility today

Dash - Prompt to run your entire workflow — no tool-switching.

Dash is your AI agent for modern work. It connects to your apps, remembers your context, and gets things done — so you stop wasting time switching, searching, and starting over. Full control, one conversation.

Share this post with friends, especially those interested in AI stories!