Forget Text, You Can Now Get Audio and Video Responses From Your AI

2023 seems to be the year where artificial intelligence steps out of the shadows of text-dominated outputs. Meet NExT-GPT, an innovation borne from the intellectual quarters of the National University of Singapore and Tsinghua University. This model is not just another AI; it goes beyond textual responses to encompass image, audio, and video outputs.

Versatile Conversations

Where conversational AI has become synonymous with text outputs, NExT-GPT breathes fresh air into the scene. “It’s an ‘any-to-any’ system,” as coined by its developers. This means that users can input a text prompt and receive a video response, or vice versa. It defies the norm, presenting a holistic approach to machine-human interaction. ChatGPT might have earned its stripes in the realm of textual conversation, but NExT-GPT is set to redefine those parameters.

Performance Review

While it’s still in its developmental stages, early interactions with NExT-GPT are illuminating. A trial on its demo site showcases its ability to transform a picture of a cat into an image of the cat assuming the role of a librarian. Although the quality might not match established image generators, the creativity and innovation are evident.

Moving Above the Textual Format

“The era of pure text is over,” says a researcher from the National University of Singapore. NExT-GPT is pegged as an open-source model, a creation that doesn’t just break the mould but reshapes it. It’s tailored to accommodate and process a combination of text, images, audio, and video, embodying the evolution of AI into a multimodal entity.

NExT-GPT employs a technique termed “modality-switching instruction tuning” to enhance its cross-modal reasoning abilities. Each type of input, be it text, image, audio, or video, is converted into embeddings that the core language model comprehends. It’s a dance of technology where artificial intelligence meets multimedia, producing a symphony of interactive outputs.


Real-World Resonance
In practical terms, NExT-GPT isn’t just a technological spectacle; it’s a tool with tangible applications. It finds its place in the creation of intelligent virtual assistants and the nuanced field of video analysis. It’s not just about responding to queries but about understanding and interpreting multimodal inputs in a context that’s both meaningful and relevant.

The Open Source Advantage

Being open source, NExT-GPT isn’t confined to the original blueprint. It’s a canvas where AI enthusiasts and developers can paint their innovations. There’s a democracy in its design, a freedom that allows for the modification and enhancement of the model to suit diverse needs.

Though NExT-GPT’s offerings are impressive, they are not without their flaws. Early testers reported less than perfect outcomes with the video and audio features, marking an area for improvement. The distortions in the generated videos and audio highlight the nascent stage of this technology, illuminating the path for future refinements.

Market Dynamics
AI headliners like OpenAI and Google are great tools with even greater developments, but NExT-GPT carves its niche as an open-source alternative. Multimodality is no longer just a distant dream but a present reality. The intermittent availability of the demo site, however, underlines the ongoing development of this ambitious project.

The Developer’s Point of View

Joeli Brearley, founder of Pregnant Then Screwed, spoke towards costs. “The costs for those parents will drastically increase,” she said, illuminating the economic and social dimensions that AI innovations like NExT-GPT are set to navigate.

NExT-GPT isn’t the endpoint but rather a milestone in the continuous evolution of artificial intelligence. As it stands, it’s a testament to the strides made in AI, a narrative of progress that goes beyond text and steps into the rich, multifaceted realm of multimodal interaction. It’s not just about what AI can do today but about the vistas of possibilities it unlocks for tomorrow.