[Question] Example request. LocalAudioTransport + Whisper + llm + tts #197

gaceladri · 2024-05-31T08:41:51Z

Hi 👋

I am having trouble running a local example that integrates LocalAudioTransport, WhisperSTTService, ElevenLabsTTSService, and OpenAILLMService.

I have successfully managed to run Whisper locally for transcription and another script that uses Eleven Labs and OpenAI for TTS and LLM services, respectively. However, I am struggling to combine these components to create a fully functional local conversation system.

To illustrate, here are the two examples I have working independently:

Example 1: Passing an LLM message to the TTS provider:

import asyncio
import os
import sys

import aiohttp
from loguru import logger
from pipecat.frames.frames import EndFrame, LLMMessagesFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.transports.base_transport import TransportParams
from pipecat.transports.local.audio import LocalAudioTransport

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")


async def main():
    async with aiohttp.ClientSession() as session:
        transport = LocalAudioTransport(TransportParams(audio_out_enabled=True))

        tts = ElevenLabsTTSService(
            aiohttp_session=session,
            api_key=os.getenv("ELEVENLABS_API_KEY"),
            voice_id=os.getenv("ELEVENLABS_VOICE_ID"),
        )

        llm = OpenAILLMService(
            api_key=os.getenv("OPENAI_API_KEY"),
            model="gpt-3.5-turbo-0125",
        )

        messages = [
            {
                "role": "system",
                "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
            },
        ]

        pipeline = Pipeline([llm, tts, transport.output()])

        task = PipelineTask(pipeline)

        async def say_something():
            await asyncio.sleep(1)
            await task.queue_frames([LLMMessagesFrame(messages), EndFrame()])

        runner = PipelineRunner()

        await asyncio.gather(runner.run(task), say_something())


if __name__ == "__main__":
    asyncio.run(main())

Example 2: Using Whisper locally:

import asyncio
import sys

from loguru import logger
from pipecat.frames.frames import Frame, TranscriptionFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.services.whisper import Model, WhisperSTTService
from pipecat.transports.base_transport import TransportParams
from pipecat.transports.local.audio import LocalAudioTransport

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")


class TranscriptionLogger(FrameProcessor):
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        if isinstance(frame, TranscriptionFrame):
            print(f"Transcription: {frame.text}")


async def main():
    transport = LocalAudioTransport(TransportParams(audio_in_enabled=True))

    stt = WhisperSTTService()

    tl = TranscriptionLogger()

    pipeline = Pipeline([transport.input(), stt, tl])

    task = PipelineTask(pipeline)

    runner = PipelineRunner()

    await runner.run(task)


if __name__ == "__main__":
    asyncio.run(main())

Despite these individual successes, I'm unable to connect the transcriptions with the LLM and have a continuous conversation. Could you provide or add an example of a fully working local setup that demonstrates how to achieve this?

Thank you!

The text was updated successfully, but these errors were encountered:

ajram23 · 2024-06-01T20:43:06Z

I got it running, DM me and I will give you the script.

gaceladri · 2024-06-01T20:59:54Z

@ajram23 Can you paste it here? 🙏 I didn't know that I can dm someone in GitHub!

ajram23 · 2024-06-01T22:10:50Z

07-interruptible-local.py.txt
Here you go! Enjoy! cc @aconchillo just in case you want to add this to the examples folder.

gaceladri · 2024-06-02T07:33:07Z

@ajram23 Thank you for your example! It's quite similar to what I have implemented. Were you able to interact with the LLM? In my case, I can see the initial message from the LLM, but I seem to have an issue with the communication between the Whisper service and the LLMUserResponseAggregator.

Here is my current code:

#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#

import asyncio
import os
import sys

import aiohttp
from loguru import logger
from pipecat.frames.frames import Frame, LLMMessagesFrame, TranscriptionFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_response import (
    LLMAssistantResponseAggregator,
    LLMUserResponseAggregator,
)
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.whisper import WhisperSTTService
from pipecat.transports.base_transport import TransportParams
from pipecat.transports.local.audio import LocalAudioTransport
from pipecat.vad.silero import SileroVADAnalyzer
from pipecat.vad.vad_analyzer import VADParams

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")


class TranscriptionLogger(FrameProcessor):
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        if isinstance(frame, TranscriptionFrame):
            logger.debug(f"Whisper transcription: {frame.text}")


async def main():
    async with aiohttp.ClientSession() as session:
        transport = LocalAudioTransport(
            TransportParams(
                audio_in_enabled=True,
                audio_out_enabled=True,
                transcription_enabled=True,
                vad_enabled=True,
                vad_analyzer=SileroVADAnalyzer(params=VADParams(min_volume=0.6)),
                vad_audio_passthrough=True,
            )
        )

        stt = WhisperSTTService(no_speech_prob=0.6)

        tts = ElevenLabsTTSService(
            aiohttp_session=session,
            api_key=os.getenv("ELEVENLABS_API_KEY"),
            voice_id="2ovNLFOsfyKPEWV5kqQi",
        )

        llm = OpenAILLMService(
            api_key=os.getenv("OPENAI_API_KEY"),
            model="gpt-3.5-turbo-0125",
        )

        messages = [
            {
                "role": "system",
                "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
            },
        ]

        tma_in = LLMUserResponseAggregator(messages)
        tma_out = LLMAssistantResponseAggregator(messages)

        pipeline = Pipeline(
            [
                transport.input(),  # Transport user input
                stt,  # STT
                tma_in,  # User responses
                llm,  # LLM
                tts,  # TTS
                transport.output(),  # Transport bot output
                tma_out,  # Assistant spoken responses
            ]
        )

        task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))

        runner = PipelineRunner()

        async def say_something():
            messages.append(
                {"role": "system", "content": "Please introduce yourself to the user."}
            )
            await task.queue_frames([LLMMessagesFrame(messages)])

        await asyncio.gather(runner.run(task), say_something())


if __name__ == "__main__":
    asyncio.run(main())

Here are my pipeline debug messages:

2024-06-02 09:25:37.876 | DEBUG    | pipecat.services.whisper:_load:67 - Loaded Whisper model
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking PipelineSource#0 -> LocalAudioInputTransport#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking LocalAudioInputTransport#0 -> WhisperSTTService#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking WhisperSTTService#0 -> LLMUserResponseAggregator#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking LLMUserResponseAggregator#0 -> OpenAILLMService#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking OpenAILLMService#0 -> ElevenLabsTTSService#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking ElevenLabsTTSService#0 -> LocalAudioOutputTransport#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking LocalAudioOutputTransport#0 -> LLMAssistantResponseAggregator#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking LLMAssistantResponseAggregator#0 -> PipelineSink#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.processors.frame_processor:link:37 - Linking Source#0 -> Pipeline#0
2024-06-02 09:25:37.939 | DEBUG    | pipecat.pipeline.runner:run:29 - Runner PipelineRunner#0 started running PipelineTask#0
2024-06-02 09:25:37.940 | DEBUG    | pipecat.services.openai:_stream_chat_completions:69 - Generating chat: [{"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.", "role": "system", "name": "system"}, {"content": "Please introduce yourself to the user.", "role": "system", "name": "system"}]
2024-06-02 09:25:38.694 | DEBUG    | pipecat.services.openai:_stream_chat_completions:96 - OpenAI LLM TTFB: 0.7542369365692139
2024-06-02 09:25:38.719 | DEBUG    | pipecat.services.elevenlabs:run_tts:35 - Transcribing text: [Hello!]
2024-06-02 09:25:39.323 | DEBUG    | pipecat.services.elevenlabs:run_tts:35 - Transcribing text: [I am your helpful Legal Language Model here to assist you during this WebRTC call.]
2024-06-02 09:25:40.177 | DEBUG    | pipecat.services.elevenlabs:run_tts:35 - Transcribing text: [How can I help you today?]
2024-06-02 09:25:42.710 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  Okay. 
2024-06-02 09:25:44.907 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  Okay. 
2024-06-02 09:25:47.174 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  you're 
2024-06-02 09:25:48.758 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  Okay. 
2024-06-02 09:25:51.009 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  I'm 
2024-06-02 09:25:53.277 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  Okay. 
2024-06-02 09:25:54.880 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  I did. 
2024-06-02 09:26:13.526 | DEBUG    | pipecat.services.whisper:run_stt:86 - Whisper transcription:  Hey.

For some reason, the transcriptions from Whisper are not being passed to the LLMUserResponseAggregator. I've added print statements inside the LLMUserResponseAggregator to check the messages, but nothing is logged after the Whisper model transcribes the speech.

I'm running this on a Mac M2.

Any insights or suggestions on what might be going wrong would be greatly appreciated!

Thank you for your help!

ajram23 · 2024-06-02T13:52:27Z

@gaceladri in my case I was able to, not sure what is going with yours.

gaceladri · 2024-06-03T08:32:11Z

@gaceladri in my case I was able to, not sure what is going with yours.

Ok, thank you for the support and feedback!

gaceladri closed this as completed Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Example request. LocalAudioTransport + Whisper + llm + tts #197

[Question] Example request. LocalAudioTransport + Whisper + llm + tts #197

gaceladri commented May 31, 2024 •

edited

Loading

ajram23 commented Jun 1, 2024

gaceladri commented Jun 1, 2024 •

edited

Loading

ajram23 commented Jun 1, 2024 •

edited

Loading

gaceladri commented Jun 2, 2024

ajram23 commented Jun 2, 2024

gaceladri commented Jun 3, 2024

[Question] Example request. LocalAudioTransport + Whisper + llm + tts #197

[Question] Example request. LocalAudioTransport + Whisper + llm + tts #197

Comments

gaceladri commented May 31, 2024 • edited Loading

ajram23 commented Jun 1, 2024

gaceladri commented Jun 1, 2024 • edited Loading

ajram23 commented Jun 1, 2024 • edited Loading

gaceladri commented Jun 2, 2024

ajram23 commented Jun 2, 2024

gaceladri commented Jun 3, 2024

gaceladri commented May 31, 2024 •

edited

Loading

gaceladri commented Jun 1, 2024 •

edited

Loading

ajram23 commented Jun 1, 2024 •

edited

Loading