My Simple Speech To Text

August 24, 2025

The Zero-UI Transcription Stack: Native-feel on iOS/Mac, serverless-fast, nearly free

This project turns S3 into your “native” transcription workspace. Record on iPhone, tap Share → “Upload to S3,” and your text shows up moments later in the same Files view—on iOS and macOS—without opening an app. Under the hood: an S3 event triggers a Chalice-powered Lambda, Whisper transcribes, Gemini tidies to clean Markdown, the result is saved right back to S3, and your phone gets a push notification.

Why this is interesting

  • Native UX, no app UI: The iOS Files app serves as the UI. Drag audio into audio/, read Markdown from transcriptions/. Feels local; works everywhere.
  • Beautifully small serverless: Chalice makes Lambda ergonomic, so the whole pipeline stays compact and readable.
  • High quality + low cost: Whisper-large-v3 for accuracy; Gemini 2.5 Pro for formatting; end-to-end cost ~ $0.01/min.
  • Long-form resilient: Built for long voice notes that turn into essays.

How it works (end-to-end)

  1. Upload audio to S3 audio/.
  2. S3 event invokes Lambda (Chalice).
  3. Whisper (Groq) transcribes to text.
  4. Gemini formats to clean, paragraph-structured Markdown.
  5. Result saved to S3 transcriptions/filename.ext.txt.
  6. Phone notified via ntfy.sh; open in Files immediately.

The core flow in three steps (from the repo)

@app.on_s3_event("audio-to-transcribe1", events=["s3:ObjectCreated:*"], prefix="audio/")
def transcribe_audio(event):
    """
                transcription = groq_client.audio.transcriptions.create(
                    file=(filename, file.read()),
                    model="whisper-large-v3",
                    response_format="verbose_json",
                    language="en",
                    temperature=0.0,
                    prompt="This is a audio recording. Please transcribe accurately with proper punctuation.",
                )
            formatted_text = generate_formatted_transcription(transcription.text)
 
            input_key_without_prefix = event.key.replace("audio/", "")
            output_key = f"transcriptions/{input_key_without_prefix}.txt"
 
            # Save formatted transcription back to S3.
            s3.put_object(
                Bucket=event.bucket,
                Key=output_key,
                Body=formatted_text,
                ContentType="text/plain",
            )

What makes it different

  • Files-as-API: Treat S3 like a shared folder synced to iOS/macOS. No custom app, no queue UI, no dashboards.
  • Opinionated formatting: Gemini upgrades raw ASR into publishable Markdown—paragraph breaks, grammar, headings—ready to paste.
  • Developer ergonomics: chalice for Lambda, uv for fast Python envs. Minimal code, maximum leverage.
  • Observability-first: Clean logs, simple notifications (ntfy.sh), and a clear S3 structure for auditing.

Extending it

  • Multi-language, diarization, or redaction.
  • Per-prefix routing (e.g., different models per folder).
  • Signed-link sharebacks or webhooks.
  • Cost guardrails (duration caps) and retry/backoff policies.

Security notes

  • Keep keys in environment variables.
  • Restrict S3 prefixes and IAM permissions.
  • Limit allowed input formats at the bucket level.

Style inspiration: Create a Django and React app with auto-generated Django types

  • Key innovation: using S3 + Files as the UI, with Chalice to turn S3 events into a clean serverless pipeline.
  • Interesting elements: native-feel UX, Whisper+Gemini two-stage quality, tiny operational surface area, and low per-minute cost.

Want to ship better features with AI?
Join my free weekly newsletter.

No spam guaranteed Unsubscribe whenever