Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.ds File Format and CLI Inference Scripts Documentation #198

Open
PeterFavero opened this issue Jun 24, 2024 · 2 comments
Open

.ds File Format and CLI Inference Scripts Documentation #198

PeterFavero opened this issue Jun 24, 2024 · 2 comments

Comments

@PeterFavero
Copy link

PeterFavero commented Jun 24, 2024

Hello,

I'm interested in running command line inference using the .ckpt's of the model I trained, but after reading the instructions under Inference in docs/GettingStarted.md and the outputs of --help on the appropriate inference scripts (Specifically python scripts/infer.py variance --help and python scripts/infer.py acoustic --help) I don't fully understand the details of how .ds files work and, less importantly, what the details of some of the parameters to infer.py script as well (I largely understand what all of the parameters control but am interested in how to configure --num, --key, --expr, and --step based on a more precise understanding of what they actually do alongside general best practices for those parameters), as there is no thorough documentation here on either of these topics. The .ds docs may be out of scope for this repo (I looked briefly on the original OpenUtau repo and the recommended fork for OpenUtau with DiffSinger, but didn't find anything), but do you know where I could find both such docs to reference for my project?

Thank you,
Peter

@hrukalive
Copy link

.ds files are just JSON in disguise, you can open it with any text editor. The structure inside is intuitive, so I would not explain them here, but please follow up if you have further questions. To do inference using CLI, you most likely will perform variance inference and then acoustic inference.

Variance inference will add new fields to each "sentence" in the .ds file, such as breath, voicing, or any enabled feature with your checkpoint. You have to use the variance checkpoint to infer every required parameter for the acoustic model. The output of this inference step is a new .ds file.

Then input the .ds file from the previous to the acoustic inference and get you .wav file out. Arguments like --key are transpositions applied to the .ds file globally, so these have nothing to do with inference quality. --step does control quality, but depending on whether you used Rectified Flow model, the recommended steps are different.

@PeterFavero
Copy link
Author

PeterFavero commented Jun 25, 2024

Hi there,

Thank you so much for the reply!

I really appreciate the insight, and this is my bad for not mentioning this earlier, but I think it would probably help if I elaborate a bit in terms of my use case. Given some music and phonemic data in a not-necessarily .ds format about sung audio over some time interval (ex, midi/an f0 spectrum and the set of phonemes I want the model to sing, with start and end times for vowel phonemes already set in stone), I want to be able to generate audio of my DiffSinger (composed of a duration-only variance model (predict_dur : true, predict_pitch : false), a pitch-only variance model (predict_dur : false, predict_pitch : true), and an acoustic model w/ energy, breathiness, tension, and voicing all disabled for now) singing that audio using only code and/or CLI commands. Some examples in increasing order of complexity include:

  • Instruct my DiffSinger to sing the word "red" [r eh d] from t = 0 -> 0.5, with 'eh' held from t = 0.1 -> 0.4 sec, at a uniform f0 of 440 hz (A4).
  • Instruct my DiffSinger to sing the word "strength" [s t r eh ng th] from t = 0 -> 0.5, with 'eh' held from t = 0.15 -> 0.35 sec, at a uniform f0 of 440 hz (A4).
  • Same as above, but with a changing f0-over-time-sequence.
  • Instruct my DiffSinger to sing the phrase "And I will always love you" [ax n d ay w ih l ao l w ey z l ah v y uw] or "I am titanium" [ay ae m t ay t ey n iy ax m]: *
    • Over a time interval corresponding to the length of that line in the original song (which I have complete waveform data for).
    • With all vowels lining up to their intervals in the original song.
    • According to an f0-over-time-sequence generated from the song (that I could compute myself using PyWorld or, more likely, export from OpenUtau).
    • With appropriate slurs between notes and accounting for the possibilities of notes changing within long-held phonemes, such as the "I" in the first example.

I'd already gotten .ds data from several .wavs in my training dataset from OpenUtau and examined them in VsCode, and while I could understand what each field meant like you said without any issues, I was asking for a bit more granular details. Accomplishing my use case using CLI commands would likely look like generating a .ds file from scratch or editing it procedurally after exporting one from OpenUtau, and then generating specific fields from my two variance models (not sure about the order to apply them) before running the .ds file through the acoustic, possibly with some intermediary editing as well. Because I didn't know exactly how I'd implement this process, which seemed fairly complicated and error-prone, I wanted some more thorough spec ** for how the .ds files and inference scripts worked, since I was changing .ds files only slightly from what I got from OpenUtau early on was getting a lot of difficult-to-understand errors. This is also my first time training a DiffSinger, so that didn't help either.

However, I then checked out the scripts in this repo a little more closely, and noticed that deployment/benchmarks/acoustic.py was doing something relatively similar to my use case using an onnx run time, which I'm a bit more familiar with than the infer.py script and the .ds file type. Additionally, I'm thinking of creating a simple web app for my musician friends and I to use with this model, and that would involve cloud-deploying it, which .onnx is much better suited for towards as well for efficiency and dependency configs. Apologies for the long and semi-tangential thread, but is it possible to use onnx runtimes of my three models to accomplish these goals? I understand I'm reimplementing a lot of what OpenUtau does under the hood here in python, but wanted to get your expertise on the DiffSinger-related portions of my task as I keep working on it.

* I wouldn't use any copyrighted material, I'm just using taking these songs for the sake of example.
** Absolutely no pressure, as I'm more interested in using onnx at this point, but my previous questions included: is there a comprehensive list of possible .ds fields somewhere similar to ConfigurationSchemas.md but for .ds files, which fields are required/optional/forbidden for variance/acoustic models depending on different configs, which fields inference with variance/acoustic models adds or changes depending on configs and infer.py arguments, where and how to properly get or construct the data for the original my_song.ds file for different use cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants