Skip to content

ictnlp/NAST-S2x

Repository files navigation

NAST-S2X: A Fast and End-to-End Simultaneous Speech-to-Any Translation Model

arXiv model

News🔥

  • (2024/06/27) We have created a tutorial to guide you through preprocessing the data and running NAST-S2X on your machine. You can find it at this URL.
  • We have published our paper on arXiv, available at https://arxiv.org/abs/2406.06937.
  • We have released the checkpoints and datasets for reference at Hugging Face🤗.

Features

  • 🤖 An end-to-end model without intermediate text decoding
  • 💪 Supports offline and streaming decoding of all modalities
  • ⚡️ 28× faster inference compared to autoregressive models

Examples

We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.

  • Generation with chunk sizes of 320 ms and 2560 ms starts generating English translation before the source speech is complete.
  • In the examples of simultaneous interpretation, the left audio channel is the input streaming speech, and the right audio channel is the simultaneous translation.

Note

For a better experience, please wear headphones.

Chunk Size 320ms Chunk Size 2560ms Offline
CS_320ms.mp4
CS_2560ms.mp4
Offline.mp4
Source Speech Transcript Reference Text Translation
Avant la fusion des communes, Rouge-Thier faisait partie de la commune de Louveigné. before the fusion of the towns rouge thier was a part of the town of louveigne

Note

For more examples, please check https://nast-s2x.github.io/.

Performance

  • ⚡️ Lightning Fast: 28× faster inference and competitive quality in offline speech-to-speech translation
  • 👩‍💼 Simultaneous: Achieves high-quality simultaneous interpretation within a delay of less than 3 seconds
  • 🤖 Unified Framework: Support end-to-end text & speech generation in one model

Check Details 👇

Offline-S2S Simul-S2S Simul-S2T
image image image

Architecture

  • Fully Non-autoregressive: Trained with CTC-based non-monotonic latent alignment loss (Shao and Feng, 2022) and glancing mechanism (Qian et al., 2021).
  • Minimum Human Design: Seamlessly switch between offline translation and simultaneous interpretation by adjusting the chunk size.
  • End-to-End: Generate target speech without target text decoding.

Sources and Usage

Model

Note

We release French-to-English speech-to-speech translation models trained on the CVSS-C dataset to reproduce results in our paper. You can train models in your desired languages by following the instructions provided below.

🤗 Model card

Chunk Size checkpoint ASR-BLEU ASR-BLEU (Silence Removed) Average Lagging
320ms checkpoint 19.67 24.90 -393ms
1280ms checkpoint 20.20 25.71 3330ms
2560ms checkpoint 24.88 26.14 4976ms
Offline checkpoint 25.82 - -
Vocoder
checkpoint

Inference

Warning

Before executing all the provided shell scripts, please ensure to replace the variables in the file with the paths specific to your machine.

Offline Inference

Simultaneous Inference

  • We use our customized fork of SimulEval: b43a7c to evaluate the model in simultaneous inference. This repository is built upon the official SimulEval: a1435b and includes additional latency scorers.
  • Data preprocessing: Follow the instructions in the document.
  • Streaming Generation and Evaluation: Execute streaming_infer.sh

Train your own NAST-S2X

Citing

Please kindly cite us if you find our papers or codes useful.

@inproceedings{
ma2024nonautoregressive,
title={A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation},
author={Ma, Zhengrui and Fang, Qingkai and Zhang, Shaolei and Guo, Shoutao and Feng, Yang and Zhang, Min
},
booktitle={Proceedings of ACL 2024},
year={2024},
}
@inproceedings{
fang2024ctcs2ut,
title={CTC-based Non-autoregressive Textless Speech-to-Speech Translation},
author={Fang, Qingkai and Ma, Zhengrui and Zhou, Yan and Zhang, Min and Feng, Yang
},
booktitle={Findings of ACL 2024},
year={2024},
}