【笔记】通过百度智能云实现Python语音识别和语音合成

发表于 2023-07-31 更新于 2026-04-06 阅读次数：

前言

通过百度智能云实现Python语音识别和语音合成

获取百度智能云密钥

登录百度智能云并实名认证

登录百度智能云->搜索语音->进入管理控制台

我已阅读并同意

首次使用需要实名认证

免费试用180天

领取免费资源

语音识别->勾选全部->0元领取

语音合成->勾选全部->0元领取

前往应用列表

创建应用并获取密钥

应用列表->创建应用

填写基本信息->立即创建

查看应用详情

获取AppID、API Key、Secret Key

下载依赖

调用系统麦克风

1 2	brew install portaudio pip3 install pyaudio

百度API

1	pip3 install baidu-aip

语音识别

1	pip3 install SpeechRecognition

播放声音

MacOS上需要操作ObjectC才能播放声音

1	pip3 install pyobjc

播放声音

1	pip3 install playsound

踩坑

pip安装playsound失败，报错：Getting requirements to build wheel ... error error: subprocess-exited-with-error

解决问题

更新wheel后再安装playsound

1 2	pip3 install --upgrade wheel pip3 install playsound

语音识别

将语音存储为声音文件，并将语音转换为文字

<APP_ID>：百度智能云的AppID
<API_KEY>：百度智能云的AppID
<SECRET_KEY>：百度智能云的AppID

timeout=：超时时间，单位秒

from aip import AipSpeech
import speech_recognition as SpeechRecognition

APP_ID = "<APP_ID>"
API_KEY = "<API_KEY>"
SECRET_KEY = "<SECRET_KEY>"
client = AipSpeech(APP_ID, API_KEY, SECRET_KEY)
sr = SpeechRecognition.Recognizer()
# 收集的声音文件
debug_file_name = "./debug.wav"


# 从麦克风收集音频并保存为文件
def _record(if_cmu: bool = False, rate=16000):
    # 收集声音
    with SpeechRecognition.Microphone(sample_rate=rate) as source:
        # 处理噪音
        sr.adjust_for_ambient_noise(source, duration=1)
        # 等待接收音频
        print("等待接收音频")
        audio = sr.listen(source, timeout=15, phrase_time_limit=2)
    # 存储为文件
    with open(debug_file_name, "wb") as f:
        f.write(audio.get_wav_data())

    if if_cmu:
        return audio
    else:
        return _get_file_content(debug_file_name)


# 从本地加载音频，作为百度语音识别的输入
def _get_file_content(file_name):
    with open(file_name, "rb") as f:
        audio_data = f.read()
    return audio_data


# 通过百度智能云语音转文字
def speech_to_text_baidu(audio_path: str = debug_file_name, if_microphone: bool = True):
    if if_microphone:  # 麦克风输入
        result = client.asr(_record(), "wav", 16000, {
            "dev_pid": 1537  # 识别中文普通话
        })
    else:  # 从文件中读取
        result = client.asr(_get_file_content(audio_path), "wav", 16000, {
            "dev_pid": 1537  # 识别中文普通话
        })

    if result["err_msg"] != "success.":
        return "语音识别失败: " + result["err_msg"]
    else:
        return result["result"][0]


result = speech_to_text_baidu()
# DEBUG
print(result)

语音合成

from aip import AipSpeech
from playsound import playsound

APP_ID = "<APP_ID>"
API_KEY = "<API_KEY>"
SECRET_KEY = "<SECRET_KEY>"
client = AipSpeech(APP_ID, API_KEY, SECRET_KEY)
# 合成的语音文件
debug_file_name = "./debug.wav"


# 文字转换为语音文件
def baidu_tts(text=""):
    result = client.synthesis(text, "zh", 1, {
        "spd": 5,  # 语速
        "vol": 5,  # 音量大小
        "per": 5,  # 发音人
    })

    if not isinstance(result, dict):
        with open(debug_file_name, "wb") as f:
            f.write(result)
    else:
        print("语音合成失败", result)


baidu_tts("文本内容")
# 播放音频文件
playsound(debug_file_name)