Click one to go to see more info
VOCALOID SynthV CeVIO Glossary


Text taken from here. Some information has been changed due to various resons.

Are vocal synths ethical? Yes. How so?

Compensation

Hatsune Miku is made out of recordings by Saki Fujita. Saki Fujita is contracted to record Miku samples, and is paid for her work.

Recording Method

Saki Fujita records from a list of sounds. It's necessary to have at least one recording per sound Miku should be able to sing. She can also sing the recording list a second time in a different octave, so that she sounds more natural.

Labelling

The samples Saki Fujita sung are then labelled with what sound they make. These sounds are then reproduced by the engine. This is how Vocal Synth software such as VOCALOID and UTAU work. This model is called "concatenative".

User interfacing

These voicebanks are very flat. Users must adjust the vocals themselves in order to produce singing. This is referred to as "tuning". If you listen to "Tuning BLANK in the style of Vocaloid producers", you can see there are countless ways to tune Hatsune Miku. It is considered a form of artistic expression.
Compare Scratchin' Melodii's original songs to the updated versions. This is the result of hiring an experienced Vocaloid tuner.


How do AI Vocal Synths work? They are actually extremely similar!

Compensation

Let's use the Synthesizer V Studio library "Solaria". Solaria is made out of recordings by Emma Rowley. Emma Rowley is contracted to record Solaria samples, and paid for her work.

Recording

Emma Rowley then records several hours of singing data. This is the substance of the library.

Base model

The AI needs a base to understand what it's interpreting. Unlike images, there is a large amount of volunteer voice data out there. It's typically assumed that base models are trained ethically.

Labelling

Labelling is also the same. The singing is broken up into phonemes the engine will interpret.

Deep Learning

In casual speech, "AI" refers to computer learning/sorting algorithms. "Diffusion" AI is the result of DNN; Deep Neural Network. It is the most drastic difference between concatenative and AI voicebanks.

Teaching the base model

The computer must be taught what the sounds are. The concept it builds is the "base model".

Training the voice model

Emma Rowley's recordings are then made into a reference point. This will make it so it will only render based on what it knows about Emma Rowley's singing.

Diffusion

The Solaria model uses everything it learned from Emma Rowley's recordings and the base mdoel to determine how 'a' sounds based on what note it's sung on, what's next to it, etcetera.

Interfacing

Tuners have been mixed on this; it sounds much clearer, yet the AI also has voice pitch models, so there's not as much as an incentive to develop your own personal flair.


Are voice changers ethical? Oh geez.

ARE they ethical?

We don't need to break this down a third time. Voice changers are the generative AI of voice synthesis. It requires a lot less work of both the developer and the user, a simple applicator of everything the machine knows onto a piece of audio.

What are the ranges of ethics?

Vocaloid 6 is packaged with a voice changer. It is only for AI libraries, voiced by people who agreed to this and were compensated. This is definitely ethical.
If you bought Hatsune Miku, you're nominally permitted to use the results as you see fit as long as you aren't making a usable port to another software. Is tuning Miku and then creating a voice changer of her singing ethical? No. The only company that has given permission to do this is ST MEDiA with SeeU.
There's also a question of art. If you were to project the voice actor onto your own personal tuning work, isn't that still artistic expression? A voice is different from an art style. Where is human expression being interrupted by automation? I can't make an explainer for those subjective concepts.


HMM Synthesis

AI vocal synths also include HMM (hidden markov model) style singing vocal synths, such as Sinsy! Sinsy uses the same principle as most TTS in the past 2 decades. It's a little outdated with the advent of deep learning.