How much data do I need to train a model?

The data requirements vary depending on the model you choose:

DDSP: Approximately 10-15 minutes of clean, monophonic recordings. Ideally with MIDI transcription (or easy to be transcribed). Best for single instrument timbres.
RAVE: 2-3 hours of clean, high-quality coherent audio (single style). Works with diverse sound types and can handle more complex timbres.
AFTER: Typically more than 1 hour of audio samples for good results, ideally with MIDI transcription (or easy to be transcribed). Supports polyphonic content.

Make sure you own the rights to use the audio as training data. You agree to be the one responsible for copyright compliance in case of prejudice.

For detailed best practices, check the documentation for each model: RAVE, AFTER, DDSP.