# Speech to Text Convert spoken audio into written text with NeuraAI's speech recognition models. Powered by Whisper, our transcription service supports multiple languages and audio formats with high accuracy. ## Overview The speech-to-text API can: * Transcribe audio files in various formats (MP3, WAV, M4A, etc.) * Support 50+ languages * Handle background noise and accents * Provide timestamps for segments * Process files up to 25MB ## Basic Transcription Convert an audio file to text: ```python from openai import OpenAI client = OpenAI( base_url="https://api.neura-ai.app/v1" ) with open("audio.mp3", "rb") as audio_file: transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file ) print(transcription.text) ``` ## Supported Audio Formats * MP3 * MP4 * MPEG * MPGA * M4A * WAV * WEBM ## Language Support Whisper automatically detects the spoken language, but you can specify it for better accuracy: ```python with open("spanish_audio.mp3", "rb") as audio_file: transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, language="es" # ISO-639-1 code ) print(transcription.text) ``` Common language codes: * `en` - English * `es` - Spanish * `fr` - French * `de` - German * `it` - Italian * `pt` - Portuguese * `nl` - Dutch * `ja` - Japanese * `ko` - Korean * `zh` - Chinese ## Response Formats ### Plain Text (Default) ```python transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="text" ) print(transcription.text) ``` ### JSON ```python transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="json" ) print(transcription.text) ``` ### Verbose JSON Get detailed information including segments and timestamps: ```python transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="verbose_json" ) print(f"Language: {transcription.language}") print(f"Duration: {transcription.duration}") for segment in transcription.segments: print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}") ``` ### SRT (Subtitles) Generate subtitle files: ```python transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="srt" ) with open("subtitles.srt", "w") as f: f.write(transcription.text) ``` ### VTT (WebVTT) For web video players: ```python transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="vtt" ) with open("subtitles.vtt", "w") as f: f.write(transcription.text) ``` ## Advanced Options ### Prompt for Context Provide context to improve accuracy: ```python transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, prompt="This is a technical discussion about machine learning and neural networks." ) ``` The prompt helps with: * Technical terminology * Proper nouns and names * Domain-specific vocabulary * Consistent spelling of terms ### Temperature Control randomness in transcription (0-1): ```python transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, temperature=0.2 # Lower = more consistent, Higher = more varied ) ``` ## Practical Examples ### Meeting Transcription ```python from openai import OpenAI client = OpenAI(base_url="https://api.neura-ai.app/v1") def transcribe_meeting(audio_path): with open(audio_path, "rb") as audio_file: transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="verbose_json", prompt="Business meeting discussing Q4 targets and marketing strategy" ) # Save full transcript with open("meeting_transcript.txt", "w") as f: f.write(transcription.text) # Save timestamped version with open("meeting_detailed.txt", "w") as f: for segment in transcription.segments: f.write(f"[{segment.start:.2f}s]: {segment.text}\n") return transcription result = transcribe_meeting("quarterly_meeting.mp3") print(f"Transcribed {result.duration:.2f} seconds of audio") ``` ### Podcast Episode ```python def transcribe_podcast(episode_file, episode_title): with open(episode_file, "rb") as audio_file: transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="srt", prompt=f"Podcast episode: {episode_title}" ) # Save as subtitle file srt_filename = episode_file.replace(".mp3", ".srt") with open(srt_filename, "w", encoding="utf-8") as f: f.write(transcription.text) print(f"Subtitles saved to {srt_filename}") transcribe_podcast("episode_42.mp3", "The Future of AI") ``` ### Interview Transcription ```python def transcribe_interview(audio_path, interviewer, interviewee): with open(audio_path, "rb") as audio_file: transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="verbose_json", prompt=f"Interview between {interviewer} and {interviewee}" ) # Format with timestamps formatted_text = f"Interview: {interviewee}\n" formatted_text += f"Date: {transcription.duration:.0f} seconds\n\n" for segment in transcription.segments: timestamp = f"[{int(segment.start//60):02d}:{int(segment.start%60):02d}]" formatted_text += f"{timestamp} {segment.text}\n" return formatted_text result = transcribe_interview( "interview.wav", "John Smith", "Jane Doe" ) print(result) ``` ### Video Subtitles Generator ```python import os def generate_subtitles(video_file): # Extract audio from video (requires ffmpeg) audio_file = video_file.replace(".mp4", ".mp3") os.system(f'ffmpeg -i "{video_file}" -q:a 0 -map a "{audio_file}" -y') # Transcribe with open(audio_file, "rb") as f: transcription = client.audio.transcriptions.create( model="whisper-1", file=f, response_format="srt" ) # Save subtitles srt_file = video_file.replace(".mp4", ".srt") with open(srt_file, "w", encoding="utf-8") as f: f.write(transcription.text) # Clean up temporary audio os.remove(audio_file) print(f"✅ Subtitles generated: {srt_file}") generate_subtitles("presentation.mp4") ``` ## Handling Large Files For files larger than 25MB, split them into chunks: ```python from pydub import AudioSegment def transcribe_large_file(file_path): # Load audio audio = AudioSegment.from_file(file_path) # Split into 10-minute chunks chunk_length_ms = 10 * 60 * 1000 chunks = [audio[i:i + chunk_length_ms] for i in range(0, len(audio), chunk_length_ms)] full_transcript = "" for i, chunk in enumerate(chunks): # Export chunk chunk_file = f"temp_chunk_{i}.mp3" chunk.export(chunk_file, format="mp3") # Transcribe with open(chunk_file, "rb") as f: transcription = client.audio.transcriptions.create( model="whisper-1", file=f ) full_transcript += transcription.text + " " # Clean up os.remove(chunk_file) return full_transcript.strip() ``` ## Best Practices ### Audio Quality * Use lossless formats (WAV) when possible * Minimum bitrate: 64 kbps * Recommended sample rate: 16kHz or higher * Reduce background noise before transcription ### Language Detection * Specify language code for better accuracy * Use prompts for technical or specialized content * For multilingual audio, transcribe segments separately ### Processing Tips * Split long files into manageable chunks * Use lower temperature (0.0-0.3) for technical content * Use higher temperature (0.5-0.8) for creative content * Provide context prompts for better terminology recognition ### Error Handling ```python def safe_transcribe(audio_path): try: with open(audio_path, "rb") as audio_file: transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="text" ) return transcription.text except FileNotFoundError: print(f"❌ File not found: {audio_path}") return None except Exception as e: print(f"❌ Transcription error: {e}") return None ``` ## Common Use Cases * **Meeting Notes** - Automatic transcription of business meetings * **Podcast Production** - Generate show notes and transcripts * **Video Subtitles** - Create accessibility captions * **Interview Analysis** - Transcribe research interviews * **Voice Notes** - Convert voice memos to text * **Customer Support** - Transcribe support calls for analysis * **Legal Documentation** - Transcribe depositions and hearings * **Medical Records** - Convert doctor dictations to text ## Limitations * Maximum file size: 25MB * Supported audio length: Up to several hours * Background noise may affect accuracy * Heavy accents may require language specification * Real-time streaming not supported (batch processing only) ## Tips for Better Results 1. **Clean Audio** - Remove background noise when possible 2. **Good Microphone** - Use quality recording equipment 3. **Clear Speech** - Speak clearly and at moderate pace 4. **Context Prompts** - Provide relevant context for technical terms 5. **Specify Language** - Set language code for non-English audio 6. **Format Choice** - Use verbose JSON for editing, SRT for subtitles