Speech to Text

View as Markdown

Convert spoken audio into written text with NeuraAI’s speech recognition models. Powered by Whisper, our transcription service supports multiple languages and audio formats with high accuracy.

Overview

The speech-to-text API can:

  • Transcribe audio files in various formats (MP3, WAV, M4A, etc.)
  • Support 50+ languages
  • Handle background noise and accents
  • Provide timestamps for segments
  • Process files up to 25MB

Basic Transcription

Convert an audio file to text:

1from openai import OpenAI
2
3client = OpenAI(
4 base_url="https://api.neura-ai.app/v1"
5)
6
7with open("audio.mp3", "rb") as audio_file:
8 transcription = client.audio.transcriptions.create(
9 model="whisper-1",
10 file=audio_file
11 )
12
13print(transcription.text)

Supported Audio Formats

  • MP3
  • MP4
  • MPEG
  • MPGA
  • M4A
  • WAV
  • WEBM

Language Support

Whisper automatically detects the spoken language, but you can specify it for better accuracy:

1with open("spanish_audio.mp3", "rb") as audio_file:
2 transcription = client.audio.transcriptions.create(
3 model="whisper-1",
4 file=audio_file,
5 language="es" # ISO-639-1 code
6 )
7
8print(transcription.text)

Common language codes:

  • en - English
  • es - Spanish
  • fr - French
  • de - German
  • it - Italian
  • pt - Portuguese
  • nl - Dutch
  • ja - Japanese
  • ko - Korean
  • zh - Chinese

Response Formats

Plain Text (Default)

1transcription = client.audio.transcriptions.create(
2 model="whisper-1",
3 file=audio_file,
4 response_format="text"
5)
6
7print(transcription.text)

JSON

1transcription = client.audio.transcriptions.create(
2 model="whisper-1",
3 file=audio_file,
4 response_format="json"
5)
6
7print(transcription.text)

Verbose JSON

Get detailed information including segments and timestamps:

1transcription = client.audio.transcriptions.create(
2 model="whisper-1",
3 file=audio_file,
4 response_format="verbose_json"
5)
6
7print(f"Language: {transcription.language}")
8print(f"Duration: {transcription.duration}")
9
10for segment in transcription.segments:
11 print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")

SRT (Subtitles)

Generate subtitle files:

1transcription = client.audio.transcriptions.create(
2 model="whisper-1",
3 file=audio_file,
4 response_format="srt"
5)
6
7with open("subtitles.srt", "w") as f:
8 f.write(transcription.text)

VTT (WebVTT)

For web video players:

1transcription = client.audio.transcriptions.create(
2 model="whisper-1",
3 file=audio_file,
4 response_format="vtt"
5)
6
7with open("subtitles.vtt", "w") as f:
8 f.write(transcription.text)

Advanced Options

Prompt for Context

Provide context to improve accuracy:

1transcription = client.audio.transcriptions.create(
2 model="whisper-1",
3 file=audio_file,
4 prompt="This is a technical discussion about machine learning and neural networks."
5)

The prompt helps with:

  • Technical terminology
  • Proper nouns and names
  • Domain-specific vocabulary
  • Consistent spelling of terms

Temperature

Control randomness in transcription (0-1):

1transcription = client.audio.transcriptions.create(
2 model="whisper-1",
3 file=audio_file,
4 temperature=0.2 # Lower = more consistent, Higher = more varied
5)

Practical Examples

Meeting Transcription

1from openai import OpenAI
2
3client = OpenAI(base_url="https://api.neura-ai.app/v1")
4
5def transcribe_meeting(audio_path):
6 with open(audio_path, "rb") as audio_file:
7 transcription = client.audio.transcriptions.create(
8 model="whisper-1",
9 file=audio_file,
10 response_format="verbose_json",
11 prompt="Business meeting discussing Q4 targets and marketing strategy"
12 )
13
14 # Save full transcript
15 with open("meeting_transcript.txt", "w") as f:
16 f.write(transcription.text)
17
18 # Save timestamped version
19 with open("meeting_detailed.txt", "w") as f:
20 for segment in transcription.segments:
21 f.write(f"[{segment.start:.2f}s]: {segment.text}\n")
22
23 return transcription
24
25result = transcribe_meeting("quarterly_meeting.mp3")
26print(f"Transcribed {result.duration:.2f} seconds of audio")

Podcast Episode

1def transcribe_podcast(episode_file, episode_title):
2 with open(episode_file, "rb") as audio_file:
3 transcription = client.audio.transcriptions.create(
4 model="whisper-1",
5 file=audio_file,
6 response_format="srt",
7 prompt=f"Podcast episode: {episode_title}"
8 )
9
10 # Save as subtitle file
11 srt_filename = episode_file.replace(".mp3", ".srt")
12 with open(srt_filename, "w", encoding="utf-8") as f:
13 f.write(transcription.text)
14
15 print(f"Subtitles saved to {srt_filename}")
16
17transcribe_podcast("episode_42.mp3", "The Future of AI")

Interview Transcription

1def transcribe_interview(audio_path, interviewer, interviewee):
2 with open(audio_path, "rb") as audio_file:
3 transcription = client.audio.transcriptions.create(
4 model="whisper-1",
5 file=audio_file,
6 response_format="verbose_json",
7 prompt=f"Interview between {interviewer} and {interviewee}"
8 )
9
10 # Format with timestamps
11 formatted_text = f"Interview: {interviewee}\n"
12 formatted_text += f"Date: {transcription.duration:.0f} seconds\n\n"
13
14 for segment in transcription.segments:
15 timestamp = f"[{int(segment.start//60):02d}:{int(segment.start%60):02d}]"
16 formatted_text += f"{timestamp} {segment.text}\n"
17
18 return formatted_text
19
20result = transcribe_interview(
21 "interview.wav",
22 "John Smith",
23 "Jane Doe"
24)
25print(result)

Video Subtitles Generator

1import os
2
3def generate_subtitles(video_file):
4 # Extract audio from video (requires ffmpeg)
5 audio_file = video_file.replace(".mp4", ".mp3")
6 os.system(f'ffmpeg -i "{video_file}" -q:a 0 -map a "{audio_file}" -y')
7
8 # Transcribe
9 with open(audio_file, "rb") as f:
10 transcription = client.audio.transcriptions.create(
11 model="whisper-1",
12 file=f,
13 response_format="srt"
14 )
15
16 # Save subtitles
17 srt_file = video_file.replace(".mp4", ".srt")
18 with open(srt_file, "w", encoding="utf-8") as f:
19 f.write(transcription.text)
20
21 # Clean up temporary audio
22 os.remove(audio_file)
23
24 print(f"✅ Subtitles generated: {srt_file}")
25
26generate_subtitles("presentation.mp4")

Handling Large Files

For files larger than 25MB, split them into chunks:

1from pydub import AudioSegment
2
3def transcribe_large_file(file_path):
4 # Load audio
5 audio = AudioSegment.from_file(file_path)
6
7 # Split into 10-minute chunks
8 chunk_length_ms = 10 * 60 * 1000
9 chunks = [audio[i:i + chunk_length_ms]
10 for i in range(0, len(audio), chunk_length_ms)]
11
12 full_transcript = ""
13
14 for i, chunk in enumerate(chunks):
15 # Export chunk
16 chunk_file = f"temp_chunk_{i}.mp3"
17 chunk.export(chunk_file, format="mp3")
18
19 # Transcribe
20 with open(chunk_file, "rb") as f:
21 transcription = client.audio.transcriptions.create(
22 model="whisper-1",
23 file=f
24 )
25
26 full_transcript += transcription.text + " "
27
28 # Clean up
29 os.remove(chunk_file)
30
31 return full_transcript.strip()

Best Practices

Audio Quality

  • Use lossless formats (WAV) when possible
  • Minimum bitrate: 64 kbps
  • Recommended sample rate: 16kHz or higher
  • Reduce background noise before transcription

Language Detection

  • Specify language code for better accuracy
  • Use prompts for technical or specialized content
  • For multilingual audio, transcribe segments separately

Processing Tips

  • Split long files into manageable chunks
  • Use lower temperature (0.0-0.3) for technical content
  • Use higher temperature (0.5-0.8) for creative content
  • Provide context prompts for better terminology recognition

Error Handling

1def safe_transcribe(audio_path):
2 try:
3 with open(audio_path, "rb") as audio_file:
4 transcription = client.audio.transcriptions.create(
5 model="whisper-1",
6 file=audio_file,
7 response_format="text"
8 )
9 return transcription.text
10
11 except FileNotFoundError:
12 print(f"❌ File not found: {audio_path}")
13 return None
14
15 except Exception as e:
16 print(f"❌ Transcription error: {e}")
17 return None

Common Use Cases

  • Meeting Notes - Automatic transcription of business meetings
  • Podcast Production - Generate show notes and transcripts
  • Video Subtitles - Create accessibility captions
  • Interview Analysis - Transcribe research interviews
  • Voice Notes - Convert voice memos to text
  • Customer Support - Transcribe support calls for analysis
  • Legal Documentation - Transcribe depositions and hearings
  • Medical Records - Convert doctor dictations to text

Limitations

  • Maximum file size: 25MB
  • Supported audio length: Up to several hours
  • Background noise may affect accuracy
  • Heavy accents may require language specification
  • Real-time streaming not supported (batch processing only)

Tips for Better Results

  1. Clean Audio - Remove background noise when possible
  2. Good Microphone - Use quality recording equipment
  3. Clear Speech - Speak clearly and at moderate pace
  4. Context Prompts - Provide relevant context for technical terms
  5. Specify Language - Set language code for non-English audio
  6. Format Choice - Use verbose JSON for editing, SRT for subtitles