-T.K.- Lab Notes
  • Home
  • Convention Used
  • STM32
    • Getting Started - STM32 Edition
      • Setting up STM32CubeIDE
      • Going Through A Starter Project
      • Changing STM32CubeIDE Settings
      • Pinout Quick Reference
    • Misc
      • Using Nucleo STLink to Flash Off-board Chips
      • Changing STM32 Default Boot Option
      • STM32 Flash Option Byte Recovery
      • STM32 Systick and Timeout in Interrupt Routines
      • Telesky ST-Link V2 Upgrade Firmware
      • Some Performance Measurements on STM32 MCUs
    • System Core
      • Using GPIO on STM32
      • Setting up External Interrupt on STM32
    • Analog
      • Using ADC on STM32
      • ADC Reading Sequence with DMA on STM32
      • Using OPAMP on STM32
      • Using DAC on STM32
    • Timers
      • Using RTC on STM32
      • Using TIM on STM32
    • Connectivity
      • UART
      • USART
        • USART - CAN Dongle (Fixed Size Serializer with Robust Timeout Handling)
      • CAN
      • FDCAN
      • I2C
      • SPI
        • SPI - GC9A01A LCD Screen
        • SPI - RFID
        • SPI - SD Card
      • Ethernet
        • Ethernet - LWIP
        • Ethernet - UDP
        • Ethernet - UDP Multicast
      • USB - FS
      • USB - HS
    • Middleware
      • FreeRTOS
    • Software Pack
      • STMicroelectronics.X-CUBE-AI - Sine Approximator
  • RISC-V / SoC
    • RISC-V: Baremetal From The Ground Up (Chipyard Edition)
    • Quick Start With Chipyard on Ubuntu or WSL
    • Other Chipyard Stuff
      • Debugging OsciArty with JTAG and command line GDB
      • Debugging BearlyML with JTAG and GDB
      • Booting BearlyML With External SPI Flash
      • Setting Up SD / microSD Card for vcu118 Linux Image
      • More Chipyard Stuff
    • A Minimal Chisel Development Environment with Mill
    • Vivado Stuff
      • Installing Xilinx Vivado on Ubuntu 22.04 / 24.04
      • Arty 35T / 100T UART Pins
      • Configuring Vivado DDR MIG on Arty 35T
      • Configuring Vivado DDR MIG on Nexys Video
      • Vivado Generate Flash Config .mcs File From Bitstream
      • Vivado TCL Scripts
    • Adding Custom Instructions to RISC-V GCC Toolchain
    • Kendryte K230 Bringup
      • K230 EVB Board Resource Overview
    • Intel FPGA Quartus
    • Setting up RISC-V Toolchain on Ubuntu 24.04/22.04
    • Getting Started with Zephyr
      • Getting Start with Zephyr on RISC-V System - Windows
      • Getting Started with Zephyr on RISC-V - Ubuntu
    • C Library Compile Magic
    • Setting up ExecuTorch on Ubuntu 22.04
      • Executorch on ARM
  • Motor Control
    • Recoil FOC Motor Controller
      • 0x00. Theory of Operation
      • 0x01. Components
      • 0x02. Implementation
      • 0x03. Application
    • Recoil Documentation
    • New Controller Board Soldering & Power-on Checklist
    • MJBOTS Moteus setup
    • Failed Attempt on Acceleration- and Velocity-Limited Trajectory Generation
    • Moteus Code Analyze
    • MIT Motor Controller Code Analyze
    • ODrive Setup
    • Setting up Recoil USB-CAN Adapter
      • Setting up Recoil USB-CAN Adapter - Ubuntu
      • Setting up Recoil USB-CAN Adapter - Windows
    • NTC Temperature Sense Resistor Value Calculation
  • ML/RL
    • Setting up NVIDIA Tools
      • Setting up NVIDIA Driver on Ubuntu 22.04 / 20.04
      • Getting Started with NVIDIA Isaac Lab on Ubuntu 22.04 / 24.04
      • Setting up Omniverse on Ubuntu 24.04 (2025 Ver)
      • Creating Custom Training Environment in IsaacLab via Extensions
      • NVIDIA Isaac Gym URDF Import Notes
      • Setting up TensorRT Environment on Ubuntu 22.04 / 20.04
      • Setting up NVIDIA Omniverse Isaac Sim on Ubuntu 22.04 / 20.04
      • Setting up NVIDIA Nsight System and Nsight Compute on Ubuntu 24.04
      • Getting Started with Jetson AGX Orin
        • Getting Started with Jetson Using SDK Manager on Ubuntu 22.04
        • Using Jetson AGX Orin with Provided Ubuntu 20.04 System
        • Setting up Common Software on Jetson AGX Orin
        • Solving USB-CAN and USB CH340 Driver Issue on reComputer Mini J4012
        • [Deprecated] Upgrading Jetson AGX Orin to Ubuntu 22.04
      • Solving Torch Errors
      • [Deprecated] Setting up NVIDIA Isaac Gym on Ubuntu 22.04 / 20.04
    • RL Frameworks
      • Case Study: A Dive Into LeggedGym and RSL-RL Framework
      • Case Study: A Dive Into IsaacLab
      • Getting Started with Mujoco
      • Case Study: A Dive Into Unitree-Mujoco
      • Case Study: Setting up Berkeley Humanoid
      • Case Study: Looking into robot_lab
      • Case Study: Setting up RL-SAR
      • Case Study: Getting Started with LeRobot
      • Case Study: No-Mercy Project
        • Python Mouse and Keyboard Interaction in Game Environment
        • Detecting Phara
      • OpenAI gym + Mujoco Setup
      • Gazebo Setup
    • ROS
      • Setting up ROS on Ubuntu 20.04
      • Setting up ETH ANYbotics/elevation_mapping on Ubuntu 20.04
    • ROS 2
      • Setting up ROS 2 Humble Hawksbill on Ubuntu
      • Setting up ROS 2 Humble Hawksbill on Windows 10
      • ROS 2 Issue in Ubuntu with conda
    • Google Colab
      • Colab Resource Options
      • so-vits-svc 4.0: Colab Flow
    • URDF to MJCF Mujoco Notes
    • OnShape to URDF
    • Audio Stuff
      • Microsoft TTS
      • GPTSoVITS
      • 深入浅出理解 So-VITS-SVC 原理
      • NAI-SVC Experiment Log
      • Setting up ChatTTS on Ubuntu 22.04
    • Setting up AnythingLLM on Ubuntu 22.04
    • Setting up MineDojo Environment
    • Processing the SFU Motion Capture Dataset
    • Torch Profiling
    • Setting up Unitree A1
  • 3D Modeling
    • 3D Print Tolerancing
    • Blender to OnShape Workflow
    • Onshape to Blender Workflow
    • Setting up FBX Plugin for Python on Ubuntu 22.04
    • Install Blender on Ubuntu 22.04
    • Blender Python Related
    • VRoid, MMD, Blender Workflow
  • Tools
    • Windows
      • Install WSL 2
      • Install Make on Windows
      • Remove EFI disk partition
      • SAI Color Flip/Color Inversion
      • Microsoft Visual Studio Create Software Signature
      • Connecting the SIGLENT SDS1104X-U Oscilloscope to Computer
      • Using JADENS Thermal Label Printer
      • Getting Started with XBee (ZigBee)
    • Ubuntu
      • Ubuntu 22.04 Standard Installation Procedure
      • Protobuf
      • Setting up Docker on Ubuntu 22.04
      • Linux Mounting SD Card
      • Partitioning SD card
      • Windows Ubuntu Dual Boot Issues
      • Check Disk / Folder / File Size
      • Test Disk Read/Write Speed
      • Cannot Start Chrome in Ubuntu 22.04 After Changing Network Settings
      • Configure USB Access Permissions (udev rules) on Ubuntu
      • Screen Commands
      • Disabling the "<Application> is not responding." System Message on Ubuntu
      • Install and Configure GlobalProtect UC Berkeley VPN Service on Ubuntu 22.04
      • Solving Gamepad not Detected on Ubuntu 22.04
      • Using 3DConnexion Mouse on Ubuntu with Python
      • Install Cursor the AI Editor on Ubuntu 22.04/24.04
      • Solving the .nfsXXX file cannot be deleted issue
      • Windows Remote Desktop Issues
      • nsswitch.conf
    • Lab Automation
    • Github-Related Info
    • Python
      • Publish Python Package to PyPi
      • Python Logging Utility
      • Python converting bettwen JSON and XML
      • Retrieve Github user avatar with Github API
      • Jupyter Notebook Error
    • Raspberry Pi Setup
    • Clang-Format Style Config
    • CrazyFlie Setting Up
    • Using Oscilloscope: x1 vs x10
    • Using the BWRC 3D Printer
    • Using the Leica Microscope at BWRC
    • Pair XBoxController to Raspberry Pi with Bluetooth
    • Reading FrSky Transmitter SBUS data with STM32
    • Configuring the FrSky TARANIS X9D Plus 2019 RC Controller
    • Applying Notion for Education
    • Gitbook Errata
    • Setting up SteamVR without HMD
    • CMake Best Practices
    • Adobe Premiere Pro Audio Level Settings
  • Mechanical
    • MAD Cycloidal Actuator
    • Dog Stuff
      • Fixing the Unitree A1 Robot Dog Leg Motor
      • Fixing the Unitree A1 Robot Dog Ethernet Port
      • Fixing MIT Mini Cheetah
      • Fixing the Unitree Go1 Robot Dog Ethernet Port
    • 3D Printer Profile
  • Electrical
    • A Note on the Polarity of the Famous TT Motor
    • Wiring Pinmap Convention
    • MCU Pinmap Convention
    • PCB Design and Manufacturing Conventions
    • ESP32 Cam
    • LiPo Safety
    • AS5600 Modification
    • OpenOCD and FTDI Chips
    • FT-LINK FTDI Debugger Design Considerations
    • A Study on Reset Pin Connection
    • Note on CAN Termination Resistor
  • UW
    • Digital-Twin Communication System
    • Unreal Engine Communicate with SteamVR
    • Unreal Engine Socket Communication
    • A Note on Coordinate Systems
    • NewLine Serialization Method
    • Humanoid Design Notes
      • Robot Body Ratio Issue
      • VRM Parameters
      • Note on Face Design and Manufacture
  • Workflow Automation
    • RISC-V Toolbox Website
    • Zigbee-Based Home Automation
      • Setting up Home Assistant on Raspberry Pi to Control Zigbee IoT Devices
      • Update Sonoff Zigbee 3.0 USB Dongle Plus (CC2652P)
  • Finance
    • Finance
    • UC Berkeley Reimbursement
  • Life
    • Some Interview Questions
    • Health Insurance
Powered by GitBook
On this page
  • Reference
  • Colab Notebook
  • 1. Preparing Training Data
  • Dataset Requirements
  • Training Data
  • Data Preparation
  • 2. Setting up the training environment

Was this helpful?

  1. ML/RL
  2. Google Colab

so-vits-svc 4.0: Colab Flow

Last updated 1 year ago

Was this helpful?

Reference

【so-vits-svc】手把手教你老婆唱歌

soVITS3.0数据集准备

Colab Notebook

1. Preparing Training Data

Dataset Requirements

  • 60-100 slices of audio, in .wav format

  • each slice should be around 4-8 seconds

  • Sample rate should be 44100 Hz

Training Data

This radio story series of Majo no Tabitabi is a decent training dataset:

Data Preparation

We need to assemble the audio pieces together with Adobe Pr Pro

At the beginning and end of each episode, the volume of the background music is raised. For the best quality, we will trim those sections out.

Also, we can observe that the BGM volume never raise above -30 dB. We will since use this value as the noise floor later on.

Audio export settings:

import numpy as np


# This function is obtained from librosa.
def get_rms(
    y,
    *,
    frame_length=2048,
    hop_length=512,
    pad_mode="constant",
):
    padding = (int(frame_length // 2), int(frame_length // 2))
    y = np.pad(y, padding, mode=pad_mode)

    axis = -1
    # put our new within-frame axis at the end for now
    out_strides = y.strides + tuple([y.strides[axis]])
    # Reduce the shape on the framing axis
    x_shape_trimmed = list(y.shape)
    x_shape_trimmed[axis] -= frame_length - 1
    out_shape = tuple(x_shape_trimmed) + tuple([frame_length])
    xw = np.lib.stride_tricks.as_strided(
        y, shape=out_shape, strides=out_strides
    )
    if axis < 0:
        target_axis = axis - 1
    else:
        target_axis = axis + 1
    xw = np.moveaxis(xw, -1, target_axis)
    # Downsample along the target axis
    slices = [slice(None)] * xw.ndim
    slices[axis] = slice(0, None, hop_length)
    x = xw[tuple(slices)]

    # Calculate power
    power = np.mean(np.abs(x) ** 2, axis=-2, keepdims=True)

    return np.sqrt(power)


class Slicer:
    def __init__(self,
                 sr: int,
                 threshold: float = -40.,
                 min_length: int = 5000,
                 min_interval: int = 300,
                 hop_size: int = 20,
                 max_sil_kept: int = 5000):
        if not min_length >= min_interval >= hop_size:
            raise ValueError('The following condition must be satisfied: min_length >= min_interval >= hop_size')
        if not max_sil_kept >= hop_size:
            raise ValueError('The following condition must be satisfied: max_sil_kept >= hop_size')
        min_interval = sr * min_interval / 1000
        self.threshold = 10 ** (threshold / 20.)
        self.hop_size = round(sr * hop_size / 1000)
        self.win_size = min(round(min_interval), 4 * self.hop_size)
        self.min_length = round(sr * min_length / 1000 / self.hop_size)
        self.min_interval = round(min_interval / self.hop_size)
        self.max_sil_kept = round(sr * max_sil_kept / 1000 / self.hop_size)

    def _apply_slice(self, waveform, begin, end):
        if len(waveform.shape) > 1:
            return waveform[:, begin * self.hop_size: min(waveform.shape[1], end * self.hop_size)]
        else:
            return waveform[begin * self.hop_size: min(waveform.shape[0], end * self.hop_size)]

    # @timeit
    def slice(self, waveform):
        if len(waveform.shape) > 1:
            samples = waveform.mean(axis=0)
        else:
            samples = waveform
        if samples.shape[0] <= self.min_length:
            return [waveform]
        rms_list = get_rms(y=samples, frame_length=self.win_size, hop_length=self.hop_size).squeeze(0)
        sil_tags = []
        silence_start = None
        clip_start = 0
        for i, rms in enumerate(rms_list):
            # Keep looping while frame is silent.
            if rms < self.threshold:
                # Record start of silent frames.
                if silence_start is None:
                    silence_start = i
                continue
            # Keep looping while frame is not silent and silence start has not been recorded.
            if silence_start is None:
                continue
            # Clear recorded silence start if interval is not enough or clip is too short
            is_leading_silence = silence_start == 0 and i > self.max_sil_kept
            need_slice_middle = i - silence_start >= self.min_interval and i - clip_start >= self.min_length
            if not is_leading_silence and not need_slice_middle:
                silence_start = None
                continue
            # Need slicing. Record the range of silent frames to be removed.
            if i - silence_start <= self.max_sil_kept:
                pos = rms_list[silence_start: i + 1].argmin() + silence_start
                if silence_start == 0:
                    sil_tags.append((0, pos))
                else:
                    sil_tags.append((pos, pos))
                clip_start = pos
            elif i - silence_start <= self.max_sil_kept * 2:
                pos = rms_list[i - self.max_sil_kept: silence_start + self.max_sil_kept + 1].argmin()
                pos += i - self.max_sil_kept
                pos_l = rms_list[silence_start: silence_start + self.max_sil_kept + 1].argmin() + silence_start
                pos_r = rms_list[i - self.max_sil_kept: i + 1].argmin() + i - self.max_sil_kept
                if silence_start == 0:
                    sil_tags.append((0, pos_r))
                    clip_start = pos_r
                else:
                    sil_tags.append((min(pos_l, pos), max(pos_r, pos)))
                    clip_start = max(pos_r, pos)
            else:
                pos_l = rms_list[silence_start: silence_start + self.max_sil_kept + 1].argmin() + silence_start
                pos_r = rms_list[i - self.max_sil_kept: i + 1].argmin() + i - self.max_sil_kept
                if silence_start == 0:
                    sil_tags.append((0, pos_r))
                else:
                    sil_tags.append((pos_l, pos_r))
                clip_start = pos_r
            silence_start = None
        # Deal with trailing silence.
        total_frames = rms_list.shape[0]
        if silence_start is not None and total_frames - silence_start >= self.min_interval:
            silence_end = min(total_frames, silence_start + self.max_sil_kept)
            pos = rms_list[silence_start: silence_end + 1].argmin() + silence_start
            sil_tags.append((pos, total_frames + 1))
        # Apply and return slices.
        if len(sil_tags) == 0:
            return [waveform]
        else:
            chunks = []
            if sil_tags[0][0] > 0:
                chunks.append(self._apply_slice(waveform, 0, sil_tags[0][0]))
            for i in range(len(sil_tags) - 1):
                chunks.append(self._apply_slice(waveform, sil_tags[i][1], sil_tags[i + 1][0]))
            if sil_tags[-1][1] < total_frames:
                chunks.append(self._apply_slice(waveform, sil_tags[-1][1], total_frames))
            return chunks


def main():
    import os
    import shutil
    import librosa
    import soundfile
    
    IN_FILENAME = "Elaina_TrainingSequence_1hr.wav"
    OUT_FILE_DIR = "Elaina"

    print("creating directory...")
    shutil.rmtree(OUT_FILE_DIR, ignore_errors=True)
    os.makedirs(OUT_FILE_DIR)

    print("loading audio...")
    audio, sr = librosa.load(IN_FILENAME, sr=None, mono=False)
    slicer = Slicer(
        sr=sr,
        
        # The RMS threshold presented in dB. Areas where all RMS values are below this threshold
        # will be regarded as silence. Increase this value if your audio is noisy. 
        threshold=-30,

        # The minimum length required for each sliced audio clip, presented in milliseconds.
        min_length=4000,

        # The minimum length for a silence part to be sliced, presented in milliseconds. Set
        # this value smaller if your audio contains only short breaks. The smaller this value is,
        # the more sliced audio clips this script is likely to generate. Note that this value must
        # be smaller than min_length and larger than hop_size. 
        min_interval=500,

        # Length of each RMS frame, presented in milliseconds. Increasing this value will increase
        # the precision of slicing, but will slow down the process.
        hop_size=5,

        # The maximum silence length kept around the sliced audio, presented in milliseconds. Adjust
        # this value according to your needs. Note that setting this value does not mean that
        # silence parts in the sliced audio have exactly the given length. The algorithm will search
        # for the best position to slice, as described above. 
        max_sil_kept=500
    )
    print("slicing...")
    chunks = slicer.slice(audio)

    print("writing results...")
    for i, chunk in enumerate(chunks):
        if len(chunk.shape) > 1:
            # Swap axes if the audio is stereo.
            chunk = chunk.T
        # Save sliced audio files with soundfile.
        soundfile.write(os.path.join(OUT_FILE_DIR, f"seq_{i}.wav"), chunk, sr)

if __name__ == "__main__":
    main()

Open a few audio files to verify that the audio has been sliced correctly.

We can also enable the "Length" property in the file browser to get a sense of the duration of each slice. To do so, select Sort by -> More... and click Length in the right-click menu.

Finally, zip the dataset.

2. Setting up the training environment

To download the audio, use .

Then, we need to slice the data into pieces. To do that, we can use .

https://www.bilibili.com/video/BV1d54y1m7cK/
JiJiDown
openvpi/audio-slicer
https://www.bilibili.com/video/BV1vM4y1S7zB/
https://www.bilibili.com/read/cv20514221
https://github.com/innnky/so-vits-svc/tree/4.0
Google Colaboratory
Logo