NAST-S2X: A Fast and End-to-End Simultaneous Speech-to-Any Translation Model

News🔥

(2024/06/27) We have created a tutorial to guide you through preprocessing the data and running NAST-S2X on your machine. You can find it at this URL.
We have published our paper on arXiv, available at https://arxiv.org/abs/2406.06937.
We have released the checkpoints and datasets for reference at Hugging Face🤗.

Features

🤖 An end-to-end model without intermediate text decoding
💪 Supports offline and streaming decoding of all modalities
⚡️ 28× faster inference compared to autoregressive models

Examples

We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.

Generation with chunk sizes of 320 ms and 2560 ms starts generating English translation before the source speech is complete.
In the examples of simultaneous interpretation, the left audio channel is the input streaming speech, and the right audio channel is the simultaneous translation.

Note

For a better experience, please wear headphones.

Chunk Size 320ms	Chunk Size 2560ms	Offline
CS_320ms.mp4	CS_2560ms.mp4	Offline.mp4

Source Speech Transcript	Reference Text Translation
Avant la fusion des communes, Rouge-Thier faisait partie de la commune de Louveigné.	before the fusion of the towns rouge thier was a part of the town of louveigne

Note

For more examples, please check https://nast-s2x.github.io/.

Performance

⚡️ Lightning Fast: 28× faster inference and competitive quality in offline speech-to-speech translation
👩‍💼 Simultaneous: Achieves high-quality simultaneous interpretation within a delay of less than 3 seconds
🤖 Unified Framework: Support end-to-end text & speech generation in one model

Check Details 👇

Offline-S2S	Simul-S2S	Simul-S2T

Architecture

Fully Non-autoregressive: Trained with CTC-based non-monotonic latent alignment loss (Shao and Feng, 2022) and glancing mechanism (Qian et al., 2021).
Minimum Human Design: Seamlessly switch between offline translation and simultaneous interpretation by adjusting the chunk size.
End-to-End: Generate target speech without target text decoding.

Sources and Usage

Model

Note

We release French-to-English speech-to-speech translation models trained on the CVSS-C dataset to reproduce results in our paper. You can train models in your desired languages by following the instructions provided below.

🤗 Model card

Chunk Size	checkpoint	ASR-BLEU	ASR-BLEU (Silence Removed)	Average Lagging
320ms	checkpoint	19.67	24.90	-393ms
1280ms	checkpoint	20.20	25.71	3330ms
2560ms	checkpoint	24.88	26.14	4976ms
Offline	checkpoint	25.82	-	-

Vocoder
checkpoint

Inference

Warning

Before executing all the provided shell scripts, please ensure to replace the variables in the file with the paths specific to your machine.

Offline Inference

Data preprocessing: Follow the instructions in the document.
Generate Acoustic Unit: Execute offline_s2u_infer.sh
Generate Waveform: Execute offline_wav_infer.sh
Evaluation: Using Fairseq's ASR-BLEU evaluation toolkit

Simultaneous Inference

We use our customized fork of SimulEval: b43a7c to evaluate the model in simultaneous inference. This repository is built upon the official SimulEval: a1435b and includes additional latency scorers.
Data preprocessing: Follow the instructions in the document.
Streaming Generation and Evaluation: Execute streaming_infer.sh

Train your own NAST-S2X

Data preprocessing: Follow the instructions in the document.
CTC Pretraining: Execute train_ctc.sh
NMLA Training: Execute train_nmla.sh

Citing

Please kindly cite us if you find our papers or codes useful.

@inproceedings{
ma2024nonautoregressive,
title={A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation},
author={Ma, Zhengrui and Fang, Qingkai and Zhang, Shaolei and Guo, Shoutao and Feng, Yang and Zhang, Min
},
booktitle={Proceedings of ACL 2024},
year={2024},
}

@inproceedings{
fang2024ctcs2ut,
title={CTC-based Non-autoregressive Textless Speech-to-Speech Translation},
author={Fang, Qingkai and Ma, Zhengrui and Zhou, Yan and Zhang, Min and Feng, Yang
},
booktitle={Findings of ACL 2024},
year={2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
SimulEval @ b43a7c7		SimulEval @ b43a7c7
fairseq @ b5d89cd		fairseq @ b5d89cd
nast		nast
preprocessing		preprocessing
test_scripts		test_scripts
train_scripts		train_scripts
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NAST-S2X: A Fast and End-to-End Simultaneous Speech-to-Any Translation Model

News🔥

Features

Examples

We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.

Performance

Architecture

Sources and Usage

Model

Inference

Offline Inference

Simultaneous Inference

Train your own NAST-S2X

Citing

About

Languages

ictnlp/NAST-S2x

Folders and files

Latest commit

History

Repository files navigation

NAST-S2X: A Fast and End-to-End Simultaneous Speech-to-Any Translation Model

News🔥

Features

Examples

We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.

Performance

Architecture

Sources and Usage

Model

Inference

Offline Inference

Simultaneous Inference

Train your own NAST-S2X

Citing

About

Topics

Resources

Stars

Watchers

Forks

Languages