daytonaio · Thanhdn1984 · May 30, 2026
diff --git a/guides/20260531_ai_transcription_tool_sapat.md b/guides/20260531_ai_transcription_tool_sapat.md
@@ -0,0 +1,287 @@
+---
+title: 'Build an AI Transcription Tool with Sapat'
+description:
+  'Use Sapat, Whisper-style speech-to-text APIs, and Daytona to build a reproducible AI transcription workflow.'
+date: 2026-05-31
+author: 'Thanhdn1984'
+tags: ['ai', 'transcription', 'daytona', 'openai', 'groq']
+---
+
+# Build an AI Transcription Tool with Sapat
+
+# Introduction
+
+Audio is one of the most common formats for meetings, interviews, lectures, and
+support calls, but it is awkward to search, summarize, or reuse until it becomes
+text. Modern speech-to-text APIs make transcription easier, yet each provider
+has different authentication, request formats, file limits, model names, and
+response shapes. A small wrapper can save a lot of repeated integration work.
+
+[Sapat](https://github.com/nkkko/sapat) is a lightweight project for calling AI
+APIs from a simple developer workflow. In this guide, you will use it as the
+base for an AI transcription tool that can accept an audio file, send it to a
+speech-to-text provider, and save a clean transcript. The examples focus on
+OpenAI and Groq-style APIs, but the same structure can be extended to other
+providers later.
+
+You will also run the project inside a reproducible Daytona workspace, which is
+useful when you want a clean environment for testing API clients without
+polluting your local machine.
+
+## TL;DR
+
+- Clone Sapat and open it in a Daytona workspace.
+- Store API keys in environment variables, not source code.
+- Add a small transcription command that reads an audio file and writes a
+  transcript.
+- Test the workflow with a short sample before processing large recordings.
+- Keep provider-specific code isolated so new APIs can be added safely.
+
+## Prerequisites
+
+Before starting, make sure you have:
+
+- [Daytona](https://www.daytona.io/docs/installation/installation/) installed.
+- Git installed.
+- Node.js available in your workspace if the project uses JavaScript tooling.
+- An API key for your chosen speech-to-text provider.
+- A short audio file for testing, such as `sample.mp3` or `sample.wav`.
+
+Do not commit API keys, personal recordings, or generated transcripts that may
+contain private information.
+
+## Step 1: Create a Daytona Workspace
+
+Start by creating a clean workspace from the Sapat repository:
+
+```bash
+daytona create https://github.com/nkkko/sapat
+```
+
+Open the workspace in your editor:
+
+```bash
+daytona code sapat
+```
+
+Using Daytona gives you a disposable environment where dependencies, test audio,
+and provider SDKs can be installed without affecting your main development
+setup.
+
+## Step 2: Inspect the Project
+
+Inside the workspace, review the repository structure:
+
+```bash
+ls
+find . -maxdepth 2 -type f | sort
+```
+
+Look for the current entry point, package manager files, and any existing API
+client helpers. The goal is to avoid mixing transcription logic directly into
+unrelated files. A maintainable structure usually looks like this:
+
+```text
+src/
+  providers/
+    openai-transcription.ts
+    groq-transcription.ts
+  commands/
+    transcribe.ts
+```
+
+If Sapat already has a provider abstraction, reuse it. If not, create a thin
+interface that every transcription provider can implement:
+
+```ts
+export interface TranscriptionProvider {
+  transcribe(inputPath: string): Promise<string>;
+}
+```
+
+This keeps the command simple and makes it easier to add new providers later.
+
+## Step 3: Configure API Keys Safely
+
+Create a local environment file for development:
+
+```bash
+cp .env.example .env
+```
+
+Add only the keys you need:
+
+```bash
+OPENAI_API_KEY=your_openai_key_here
+GROQ_API_KEY=your_groq_key_here
+```
+
+Make sure `.env` is ignored by Git:
+
+```bash
+grep -n "\.env" .gitignore
+```
+
+If the file is not ignored, add it before continuing. Credentials should be read
+from `process.env` or an equivalent runtime configuration mechanism.
+
+## Step 4: Add an OpenAI Transcription Provider
+
+Create a provider module that accepts a file path and returns plain text. The
+exact SDK may change over time, so keep this code small and easy to update:
+
+```ts
+import fs from 'node:fs';
+import OpenAI from 'openai';
+
+const client = new OpenAI({
+  apiKey: process.env.OPENAI_API_KEY,
+});
+
+export async function transcribeWithOpenAI(inputPath: string): Promise<string> {
+  if (!process.env.OPENAI_API_KEY) {
+    throw new Error('OPENAI_API_KEY is required');
+  }
+
+  const result = await client.audio.transcriptions.create({
+    file: fs.createReadStream(inputPath),
+    model: 'whisper-1',
+  });
+
+  return result.text ?? '';
+}
+```
+
+The important parts are validation, streaming the file from disk, and returning
+a normalized string so the rest of the app does not depend on provider response
+format.
+
+## Step 5: Add a Command-Line Wrapper
+
+Create a small command that accepts an input file and output path:
+
+```ts
+import fs from 'node:fs/promises';
+import { transcribeWithOpenAI } from '../providers/openai-transcription';
+
+const input = process.argv[2];
+const output = process.argv[3] ?? 'transcript.md';
+
+if (!input) {
+  console.error('Usage: npm run transcribe -- ./sample.mp3 transcript.md');
+  process.exit(1);
+}
+
+const transcript = await transcribeWithOpenAI(input);
+await fs.writeFile(output, `${transcript}\n`, 'utf8');
+
+console.log(`Transcript written to ${output}`);
+```
+
+Then expose it in `package.json`:
+
+```json
+{
+  "scripts": {
+    "transcribe": "tsx src/commands/transcribe.ts"
+  }
+}
+```
+
+Now the workflow is simple enough for repeat use:
+
+```bash
+npm run transcribe -- ./sample.mp3 transcript.md
+```
+
+## Step 6: Add Groq or Another Provider
+
+To support another API, create a second provider with the same return contract:
+
+```ts
+export async function transcribeWithGroq(inputPath: string): Promise<string> {
+  if (!process.env.GROQ_API_KEY) {
+    throw new Error('GROQ_API_KEY is required');
+  }
+
+  // Call the provider's audio transcription endpoint here.
+  // Return only the final transcript string.
+  return '';
+}
+```
+
+Then choose the provider with an environment variable or CLI flag:
+
+```bash
+TRANSCRIPTION_PROVIDER=openai npm run transcribe -- ./sample.mp3 transcript.md
+```
+
+A provider switch is cleaner than duplicating commands for every API.
+
+## Step 7: Test with a Short Audio File
+
+Before using long recordings, test the tool with a small file:
+
+```bash
+npm run transcribe -- ./sample.mp3 transcript.md
+sed -n '1,40p' transcript.md
+```
+
+Check for:
+
+- Empty transcript output.
+- Authentication failures.
+- Unsupported audio formats.
+- Rate-limit or file-size errors.
+- Incorrect language detection.
+
+Once the short test works, try a longer recording and measure cost and latency.
+
+## Step 8: Improve the Transcript
+
+Raw transcripts are useful, but most teams need additional cleanup. Common next
+steps include:
+
+- Speaker labels for interviews or calls.
+- Timestamps for subtitles and review.
+- Markdown formatting for notes.
+- Automatic summaries.
+- Keyword extraction.
+- Export to `.srt`, `.vtt`, or `.docx`.
+
+Keep these post-processing steps separate from the provider call. That way, the
+same cleanup pipeline can be reused with OpenAI, Groq, or any future provider.
+
+## Troubleshooting
+
+### The API key is not detected
+
+Confirm the variable is loaded in the same shell where you run the command:
+
+```bash
+echo $OPENAI_API_KEY
+```
+
+If you use a `.env` file, load it with your runtime or a package such as
+`dotenv`.
+
+### The file format is rejected
+
+Convert the audio to a common format such as MP3 or WAV:
+
+```bash
+ffmpeg -i input.m4a sample.mp3
+```
+
+### Large files fail
+
+Split long audio into smaller chunks before sending it to the provider. This
+also makes retries cheaper when a single request fails.
+
+## Conclusion
+
+Sapat can be used as a practical base for a small AI transcription workflow. By
+running it inside Daytona, keeping credentials in environment variables, and
+isolating provider-specific code, you get a setup that is reproducible, safe,
+and easy to extend. Start with one provider, test with short audio, then add
+more APIs or transcript cleanup steps as your workflow grows.