This is Part 2 of our AI Podcasts series. Part 1 covers the user experience and why we believe audio-first learning changes everything. This post is the engineering deep-dive.
NerdSip has thousands of courses. Each one contains three to eight lessons. Generating audio for every single lesson upfront would be ruinously expensive and spectacularly wasteful. The vast majority of courses will never be listened to. So we asked a different question: what if we generate audio only at the exact moment someone wants to hear it?
That question shaped the entire architecture.
The Voice Stack at a Glance
Before we get into the details, here is the full pipeline in one sentence: a user taps play, the server preprocesses the lesson text, sends it to Google Cloud TTS, receives an MP3, stores it with a content-based hash key, and streams it to the mobile player. Every subsequent request for the same content skips generation entirely and serves the cached file.
Simple in concept. The interesting parts are in the margins.
Google Cloud Text-to-Speech and Chirp3-HD
We evaluated several TTS providers before settling on Google Cloud. The deciding factor was Chirp3-HD, Google's latest multilingual voice model. It sounds remarkably natural, handles technical vocabulary well, and supports the languages we need.
The configuration is straightforward:
- English:
en-US-Chirp3-HD-Enceladus - German: a dedicated
de-DEChirp3-HD variant - Encoding: MP3 at 24kHz sample rate
- Auth: OAuth 2.0 with service account credentials
Each lesson is synthesized individually rather than the entire course at once. This keeps request payloads small, generation times fast, and failures isolated. If lesson four fails to generate, lessons one through three are still perfectly playable.
Text Preprocessing: From Markdown to Speech
Raw lesson content is not ready for a TTS engine. It contains markdown formatting, emoji, hyperlinks, and structural elements that would sound absurd read aloud. Imagine hearing "asterisk asterisk bold text asterisk asterisk" in the middle of a lesson about quantum physics.
So every lesson passes through a preprocessing pipeline before it reaches Google's API:
- Strip markdown. Bold, italic, headings, bullet markers, all removed. The text flattens to clean prose.
- Remove emoji. A Unicode-range regex catches every emoji variant. They add nothing to spoken audio.
- Clean URLs. Hyperlinks are either removed entirely or replaced with their display text.
- Add speech structure. The system wraps each lesson in a spoken template: a course title introduction, lesson numbering ("Lesson 2 of 5"), the actual content, a key takeaway, and a transition phrase leading into the next lesson.
These templates are language-aware. The English version says "Here is the key takeaway" while the German version uses the appropriate native phrasing. Small details, but they make the listening experience feel intentional rather than robotic.
The "Play First, Generate Rest" Strategy
This is the architectural decision we are proudest of.
When a user taps play, the system requests only the first lesson's audio. That single lesson typically generates in three to eight seconds. Playback begins immediately. While lesson one plays (roughly two minutes of audio), lesson two is already generating in the background.
Every track-change event triggers pre-generation of the N+1 lesson. The user finishes lesson two; lesson three is already waiting. The result is seamless, gapless playback across an entire course.
What happens if the next lesson is not ready yet? The player displays a brief "Preparing next lesson..." indicator and auto-resumes the moment the audio arrives. In practice, this almost never triggers. The generation window is nearly always shorter than the playback duration of the current lesson. The math works in our favor.
This approach has a massive cost advantage too. If a user listens to only the first lesson and moves on, we have generated exactly one lesson's worth of audio. No waste.
Smart Caching with Content Hashing
Generating the same audio twice is pure waste. Our caching layer prevents it.
Every lesson receives an MD5 hash of its text content, truncated to the first 12 characters. This hash serves as the cache key. When a user requests audio, the system checks: does an MP3 already exist for this hash?
- Cache hit: return the stored MP3 instantly. No TTS call. No cost.
- Cache miss: generate the audio, store it, and associate it with the hash.
The elegance is in what happens when course content changes. An instructor edits a lesson, the content hash changes, and the old cached audio becomes orphaned. The next play request generates fresh audio that reflects the updated content. No manual cache invalidation required. The hash handles it automatically.
Audio files are stored in Convex file storage, which is backed by Cloudflare R2. There is no expiration policy. Cached audio lives indefinitely until the underlying content changes. Storage is cheap. TTS API calls are not.
Deduplication and Rate Limiting
Concurrency creates problems. Two users tap play on the same course at the same instant. Without safeguards, both requests would trigger duplicate TTS calls for the same lesson. Multiply that across thousands of users and the API bill becomes unpredictable.
We solve this with generation locks. When a generation begins, the system writes a lock record for that specific lesson hash. Any concurrent request for the same hash sees the lock and waits for the first generation to complete rather than spawning a duplicate.
Locks auto-expire after 60 seconds. This prevents permanent deadlocks if a generation process crashes mid-flight. Sixty seconds is generous; most generations finish in under ten.
On top of deduplication, we enforce per-user rate limiting: 10 generation requests per user per minute. This prevents abuse, keeps costs bounded, and ensures fair access across the user base. The limit is high enough that no legitimate user will ever hit it during normal listening.
Background Hygiene: The 3 AM Cron
On-demand generation handles the long tail beautifully. But popular courses deserve better than making the first listener wait.
A daily cron job runs at 3:00 AM UTC. It scans for courses marked as "ready" that have zero cached audio. For each run, it pre-generates audio for up to 10 courses, prioritized by popularity signals.
This means trending courses and frequently browsed topics will have instant audio available before anyone taps play. The cron acts as a warming layer on top of the on-demand system. Both strategies complement each other: the cron handles the predictable hits, on-demand handles everything else.
Mobile Playback Architecture
Generating great audio means nothing if the playback experience is poor. On the mobile side, we use react-native-track-player, a battle-tested library for native audio playback on iOS and Android.
The feature set:
- Background playback. Audio continues when the app is minimized. Lock screen controls and notification controls work natively on both platforms.
- Variable speed. Four options: 0.75x, 1.0x, 1.25x, and 1.5x. The selected speed persists in
AsyncStorage, so users do not need to re-select it every session. - Mini player. A compact player bar stays visible at the bottom of the screen during app navigation. Tap it to expand into the full-screen player with seek bar, course artwork, and lesson metadata.
- Progressive loading. The player is tightly integrated with the generation pipeline. Track metadata updates in real-time as new lessons become available.
One subtle but important detail: the player tracks which lesson the user last listened to. If you close the app mid-course and return hours later, playback resumes from exactly where you left off.
What We Learned
Building this system reinforced a few principles that apply well beyond audio generation.
Lazy beats eager. Pre-generating audio for every course would have been the obvious approach. It would also have been the wrong one. On-demand generation with caching gives us the best of both worlds: instant playback for repeat content, minimal cost for the long tail.
Hash-based caching is underrated. Content hashing as a cache key eliminates an entire class of cache invalidation bugs. The cache is always correct because it is derived from the content itself.
Progressive loading changes perception. Users do not care about total generation time. They care about time-to-first-audio. By starting playback immediately and generating ahead, we turned a 30-second total generation time into a 5-second perceived wait.
The system now handles thousands of audio generations daily. It scales linearly, costs are proportional to actual usage, and the architecture has proven remarkably simple to maintain. Sometimes the best systems are the ones that do the least work possible, at exactly the right moment.
Frequently Asked Questions
What TTS engine does NerdSip use?
NerdSip uses Google Cloud Text-to-Speech with the Chirp3-HD voice model. English courses use the en-US-Chirp3-HD-Enceladus voice, while German courses use a dedicated de-DE equivalent. Audio is encoded as MP3 at a 24kHz sample rate.
How fast is audio generation?
A single lesson typically generates in 3 to 8 seconds depending on length. Because NerdSip uses a progressive loading strategy, the first lesson begins playing almost immediately while remaining lessons generate in the background. Most users never experience a loading pause.
Does NerdSip Voice Mode work offline?
Audio generation requires an internet connection because it relies on Google Cloud TTS. However, once a lesson has been generated and cached, subsequent plays load the cached MP3 directly from storage, which is significantly faster.
How does NerdSip keep audio in sync with course updates?
Every lesson is assigned an MD5 content hash. When course content changes, the hash changes too, which invalidates the old cache and triggers fresh audio generation on the next play. This ensures listeners always hear the latest version.
📚 Keep Learning
Ready to Listen and Learn?
Turn any course into an AI podcast. Download NerdSip and start learning by listening, wherever you are.