OpenAI finally unveiled GPT-4o! This new model offers real-time multimodal capabilities in audio, vision, and text, but now with significant enhancements. It’s free to use, which shows a strategy similar to that used with GPT 3.5—aimed at attracting new users and further scaling model training.
According to Mia Murati, one notable feature of the GPT-4o model is its performance, which is up to twice as fast compared to its predecessor, GPT-4. It also boasts a cost reduction of up to 50%. This development will enable developers to continue deploying large-scale AI projects while benefiting from these new enhancements.
But enough about the technicalities. Let’s see what this new model can do for us!
Vision in real-time
This extends beyond the basic functionality of uploading an image and interacting with it. OpenAI now lets us interact with ChatGPT through the voice assistant and we can even share content from our computers or smartphones. The responses are generated in real time, allowing for a wide range of analyses across different types and levels of complexity.
In the demo below, ChatGPT becomes a math tutor (it blew up my mind!).
We can see a fraction of ChatGPT's full capabilities here. It not only solves a math problem but also guides us toward the solution, offering clear guidelines and recommendations that help us understand the entire process in a more educational and illustrative manner.
It’s amazing how the voice and vision capabilities smoothly recognize and interpret questions.
Realtime conversational speech
OpenAI has meticulously focused on capabilities such as fluency, tone, and logical sequencing, allowing it to continue conversations in a natural way.
During the GPT-4o presentation, the model could engage in smooth conversations and even offer recommendations using a friendly tone just like a real assistant would do. The model is capable of producing voices in a range of emotive styles, allowing choices from more dramatic to serious and formal tones.
Here’s a demo that blends real-time conversational capabilities with audio translation.
The process of giving prompts to ChatGPT is slightly more complex than it seems. It needs to smoothly interpret a bilingual conversation in English and Spanish, recognizing both languages and generating responses accordingly.
I’m impressed with its response accuracy and fluency, as it effortlessly meets the set goals. It also manages to avoid the awkward pauses commonly seen in other AI systems providing real-time responses.
Are the demos as “fake” as Google’s?
In multiple demos, OpenAI tries to show that the videos are not merely a clever edit but are happening in real-time. A good example is the video below where they show how multimodal capabilities interact to provide precise responses based on what can be seen and heard.
Here are some points I want to emphasize:
It’s remarkable how ChatGPT accurately identifies and describes detailed elements. Even as the external environment became more complex with the addition of people, ChatGPT successfully recognized them.
It’s amazing that this new model can create a song that fits certain conditions. It effortlessly generated melodies!
The interaction between the two GPT models seemed almost like a glimpse into the future. While not explicitly stated, this seems to be the direction OpenAI is heading. With the capabilities shown by GPT-4o, the next step is for AI systems to interact among themselves. This could lead to one AI training another and other developments that we couldn’t imagine even in our wildest dreams.
GPT-4o has surpassed other AI models
Text Evaluation
The image shared by OpenAI clearly shows that GPT-4o outperforms other models, particularly in areas such as Math and HumanEval—attributes highly valued by users for enabling smoother and more human-like conversations.
Moreover, GPT-4o has broadened its response capabilities beyond English to include over 20 additional languages. This enhancement in language tokenization is designed to reach a wider global audience.
Audio translation performance
The enhanced capabilities of GPT-4o, coupled with its Text Evaluation feature, provide an opportunity to connect with more people, recognizing that language often serves as a barrier beyond just communication.
The chart clearly illustrates that GPT-4o has outperformed other AI systems, such as Gemini and Whisper-v3.
More than a small update
For me, this goes beyond merely being a new update to ChatGPT. It's significantly more impactful when it comes to connecting AI with the environment and maximizing its potential. Furthermore, this is exactly what I anticipated from OpenAI: continuing to deliver a product that focuses on the user through tangible and authentic actions right from the start. The role of multimodality is crucial here, and they were aware of this, hence their efforts to enhance it to provide more precise responses in various real-world contexts.
Now, we have a product that feels less "artificial" which meets some of our demands. GPT-4o is one of the first steps toward what GPT-5 will be, demonstrating OpenAI's initiative to encourage users to deploy this AI in new contexts.
it seems we get our minds blown by AI every few months ! .. one step closer to AGI which is going to be a monster disruption to every industry