Auido & Music Generation

Audio generation is the production of sound or music using AI algorithms. These algorithms can compose music, mimic voices, or generate sound effects, contributing to industries like music composition, voice synthesis, and audio production.

Courses

Hugging Face Audio course (opens in a new tab): This course equips learners with the ability to address diverse audio tasks like speech recognition, classification, and text-to-speech using transformers. It delves into audio data intricacies, explores different transformer architectures, and empowers participants to train their audio transformers by leveraging potent pre-trained models.

Advancements

MuseNet (opens in a new tab) (2019) is an AI system developed by OpenAI that can generate 4-minute musical compositions with 10 different instruments. It is a deep neural network that can combine styles from country to Mozart to the Beatles, and it can generate music in a wide range of genres.
Jukebox (opens in a new tab) (2020), an OpenAI generative music model, generates music based on genre, artist, and lyrics. It advances musical quality, but a gap with human-created music remains (paper). You can listen to its music on Jukebox Music. They have also released the source code.
MusicLM: Generating Music From Text (opens in a new tab) (2023) by Google Research generates high-fidelity music from text descriptions and melodies, transforming hummed melodies to match text captions. Google also released the MusicCaps dataset. AudioLM, another framework, generates high-quality audio, maintaining speaker identity and prosody. Both systems are remarkable AI developments for music and audio generation. (paper)
AudioLM: a Language Modeling Approach to Audio Generation (opens in a new tab) (2023): AudioLM, a Google language model, transcribes spoken language in real-time, accommodating multiple speakers and background noise. It supports various languages and accents, with applications in speech recognition, translation, voice assistants, and accessibility. Google has shared the code on GitHub for further development. It's a notable advancement in NLP (paper) (blog post).
AudioCraft (opens in a new tab) (2023), a Meta AI tool, includes MusicGen, AudioGen, and EnCodec models. MusicGen generates music, AudioGen creates audio from text inputs, and EnCodec improves music quality. It simplifies generative audio models and offers open-source code, advancing AI audio and music generation for faster feedback in early prototyping.
Lyria (opens in a new tab) (2023) by DeepMind, in collaboration with YouTube, reveals its advanced music generation model, Lyria, and two innovative AI experiments. "Dream Track" strengthens connections between artists and fans on YouTube Shorts, while "Music AI tools" collaborates with creators to enrich artistic processes.
Stable Audio (opens in a new tab), Stability AI's inaugural product, enables users to generate original music and sound effects by inputting a text prompt and duration. The high-quality audio output, in 44.1 kHz stereo, is produced using a latent diffusion for audio model trained on data from the AudioSparx music library. (website)
Qwen-Audio (opens in a new tab) by Alibaba Cloud is a robust audio chatbot and pretrained large audio language model. It's crafted to comprehend and respond to natural language queries, offering readily available code and pre-trained weights for swift implementation. Qwen-Audio is integrated into Alibaba Cloud's suite of models, alongside Qwen-VL (vision language) and Qwen (language).

Reference

Chirp v1 (opens in a new tab), developed by Suno, introduces notable improvements. It enhances audio quality, lets you select genres for your songs, supports over 50 languages, is 25% faster than Chirp v0, and empowers you to customize song structure using metatags like Verse–chorus. These updates mark the next generation of text-to-music AI.
Bark (opens in a new tab), a transformative text-to-speech model by Suno, excels in multilingual speech synthesis and generates diverse audio elements, including music, ambient sounds, and nonverbal expressions like laughter, sighs, and tears. Bark Speaker Library (v2), sample prompts for generating voices in different languages and genders. (examples) (code) (live model)
Camenduru's Audio ML Papers (opens in a new tab): The GitHub repositories encompass audio generation, music captioning, voice conversion, text-to-speech, and more. Additionally, this repository offers substantial insights into audio generation, fostering a rich learning experience.

Prompt Engineering Code Generation