A Revolutionary Tool for Audio Creation: Fugatto
In an exciting development in the field of artificial intelligence, a group of researchers specializing in generative AI has unveiled a groundbreaking tool for audio manipulation and creation, aptly described as a "Swiss Army knife" for sound. This innovative tool empowers users to manipulate and generate audio outputs with the simplicity of text inputs.
Unlike many existing AI models that can either create a piece of music or modify a voice, this new offering stands out with its versatility and flexibility. This revolutionary tool is known as Fugatto, which stands for Foundational Generative Audio Transformer Opus 1. Fugatto can generate or transform varied combinations of music, voices, and sounds based on descriptive prompts using both text and audio files.
Fugatto opens up a world of possibilities for audio creation. For instance, it can produce a new music snippet based on a simple text prompt, add or remove instruments from an existing song, change the accent or emotion in a voice, and even create entirely new sounds that have never been heard before.
Ido Zmishlany, a well-known multi-platinum producer and songwriter, and co-founder of One Take Audio, a startup involved in the NVIDIA Inception program, expressed his excitement about Fugatto. He stated, "Sound is my inspiration. It’s what drives me to create music. The idea that I can create entirely new sounds instantly in the studio is incredible."
Understanding Audio Like Humans
Rafael Valle, a manager of applied audio research at NVIDIA and one of the creators of Fugatto, explained the motivation behind the creation of this model. "We wanted to create a model that understands and generates sound like humans do," he said.
Fugatto supports a variety of audio generation and transformation tasks. It is the first foundational generative AI model that demonstrates emergent properties, which are capabilities arising from the interaction of its various trained abilities, along with the ability to combine free-form instructions.
"Fugatto represents our first step towards a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale," Valle added.
Diverse Applications and Use Cases
Fugatto is not just a tool for musicians; it has a broad range of applications across different fields. For music producers, Fugatto can be used to quickly prototype or edit an idea for a song, experimenting with different styles, voices, and instruments. It also allows for the addition of effects and enhancement of the overall audio quality of an existing track.
Zmishlany pointed out, "The history of music is also a history of technology. The electric guitar gave the world rock and roll. When the sampler showed up, hip-hop was born. With AI, we’re writing the next chapter of music. We have a new instrument, a new tool for making music — and that’s super exciting."
Advertising agencies can leverage Fugatto to customize campaigns for different regions or scenarios, applying varied accents and emotions to voiceovers. Language learning tools could be made more personal by using any voice a speaker chooses, allowing online courses to be delivered in the voice of a family member or friend.
Video game developers can use Fugatto to modify prerecorded assets to match the changing action as players progress in a game. They can also create new assets on-the-fly from text instructions and optional audio inputs.
Creating Unique Sounds
One of Fugatto’s most intriguing features is its ability to synthesize novel sounds. Valle mentioned an interesting capability they call the "avocado chair," referencing an innovative visual created by a generative AI model for imaging. With Fugatto, a trumpet can be made to bark, or a saxophone to meow, allowing users to create whatever sounds they can describe.
Moreover, with some fine-tuning and small amounts of singing data, researchers discovered that Fugatto could undertake tasks it wasn’t specifically trained for, such as generating a high-quality singing voice from a text prompt.
Artistic Control for Users
Fugatto offers several novel capabilities that enhance user experience. During inference, the model employs a technique called ComposableART, which allows it to combine instructions that were individually seen during training. For instance, users can instruct the model to speak text with a sad feeling in a French accent.
This ability to interpolate between instructions gives users fine control over text instructions, allowing them to adjust the heaviness of the accent or the degree of emotion.
Rohan Badlani, an AI researcher who designed these elements of the model, stated, "I wanted to let users combine attributes in a subjective or artistic way, deciding how much emphasis to put on each one." Badlani, who holds a master’s degree in computer science with a focus on AI from Stanford, shared, "In my tests, the results were often surprising and made me feel a bit like an artist, even though I’m a computer scientist."
Another unique feature of Fugatto is its ability to generate sounds that evolve over time, a concept referred to as temporal interpolation. For example, it can create the sound of a rainstorm moving through an area, with crescendos of thunder that gradually fade into the distance. Users can control how the soundscape develops over time.
Unlike most models, which can only recreate the training data they have been exposed to, Fugatto gives users the ability to create entirely new soundscapes that it has never encountered before, such as a thunderstorm transitioning into a dawn with the sound of birds singing.
Inside Fugatto’s Technology
Fugatto is built upon the team’s previous work in areas like speech modeling, audio vocoding, and audio understanding. The full version of Fugatto utilizes 2.5 billion parameters and was trained on NVIDIA DGX systems equipped with 32 NVIDIA H100 Tensor Core GPUs.
The creation of Fugatto was a collaborative effort involving a diverse team from around the world, including members from India, Brazil, China, Jordan, and South Korea. This global collaboration contributed to Fugatto’s multi-accent and multilingual capabilities.
One of the most challenging aspects of developing Fugatto was generating a blended dataset containing millions of audio samples for training. The team employed a multifaceted approach to generate data and instructions, significantly expanding the range of tasks the model could perform. This approach also improved accuracy and enabled new tasks without requiring additional data.
The team also carried out a meticulous examination of existing datasets, uncovering new relationships among the data. The entire development process spanned over a year.
Valle recalled two pivotal moments during the project. "The first time it generated music from a prompt, it blew our minds," he said. Later, the team demonstrated Fugatto’s ability to create electronic music with dogs barking in time to the beat. "When the group broke up with laughter, it really warmed my heart."
Fugatto represents a significant advancement in AI-driven audio generation and manipulation. Its ability to combine text and audio inputs to create novel sounds and its wide range of applications make it a powerful tool for artists, developers, and other professionals. As AI continues to evolve, tools like Fugatto are poised to redefine the way we create and experience sound.
For more Information, Refer to this article.