From bd7247f3709d7bd4fd567840d3dd4f256eb608ec Mon Sep 17 00:00:00 2001 From: Thanh Nguyen Date: Sun, 31 May 2026 05:18:36 +0700 Subject: [PATCH] Add AI transcription tool guide --- .../20260531_ai_transcription_tool_sapat.md | 287 ++++++++++++++++++ 1 file changed, 287 insertions(+) create mode 100644 guides/20260531_ai_transcription_tool_sapat.md diff --git a/guides/20260531_ai_transcription_tool_sapat.md b/guides/20260531_ai_transcription_tool_sapat.md new file mode 100644 index 00000000..6c978592 --- /dev/null +++ b/guides/20260531_ai_transcription_tool_sapat.md @@ -0,0 +1,287 @@ +--- +title: 'Build an AI Transcription Tool with Sapat' +description: + 'Use Sapat, Whisper-style speech-to-text APIs, and Daytona to build a reproducible AI transcription workflow.' +date: 2026-05-31 +author: 'Thanhdn1984' +tags: ['ai', 'transcription', 'daytona', 'openai', 'groq'] +--- + +# Build an AI Transcription Tool with Sapat + +# Introduction + +Audio is one of the most common formats for meetings, interviews, lectures, and +support calls, but it is awkward to search, summarize, or reuse until it becomes +text. Modern speech-to-text APIs make transcription easier, yet each provider +has different authentication, request formats, file limits, model names, and +response shapes. A small wrapper can save a lot of repeated integration work. + +[Sapat](https://github.com/nkkko/sapat) is a lightweight project for calling AI +APIs from a simple developer workflow. In this guide, you will use it as the +base for an AI transcription tool that can accept an audio file, send it to a +speech-to-text provider, and save a clean transcript. The examples focus on +OpenAI and Groq-style APIs, but the same structure can be extended to other +providers later. + +You will also run the project inside a reproducible Daytona workspace, which is +useful when you want a clean environment for testing API clients without +polluting your local machine. + +## TL;DR + +- Clone Sapat and open it in a Daytona workspace. +- Store API keys in environment variables, not source code. +- Add a small transcription command that reads an audio file and writes a + transcript. +- Test the workflow with a short sample before processing large recordings. +- Keep provider-specific code isolated so new APIs can be added safely. + +## Prerequisites + +Before starting, make sure you have: + +- [Daytona](https://www.daytona.io/docs/installation/installation/) installed. +- Git installed. +- Node.js available in your workspace if the project uses JavaScript tooling. +- An API key for your chosen speech-to-text provider. +- A short audio file for testing, such as `sample.mp3` or `sample.wav`. + +Do not commit API keys, personal recordings, or generated transcripts that may +contain private information. + +## Step 1: Create a Daytona Workspace + +Start by creating a clean workspace from the Sapat repository: + +```bash +daytona create https://github.com/nkkko/sapat +``` + +Open the workspace in your editor: + +```bash +daytona code sapat +``` + +Using Daytona gives you a disposable environment where dependencies, test audio, +and provider SDKs can be installed without affecting your main development +setup. + +## Step 2: Inspect the Project + +Inside the workspace, review the repository structure: + +```bash +ls +find . -maxdepth 2 -type f | sort +``` + +Look for the current entry point, package manager files, and any existing API +client helpers. The goal is to avoid mixing transcription logic directly into +unrelated files. A maintainable structure usually looks like this: + +```text +src/ + providers/ + openai-transcription.ts + groq-transcription.ts + commands/ + transcribe.ts +``` + +If Sapat already has a provider abstraction, reuse it. If not, create a thin +interface that every transcription provider can implement: + +```ts +export interface TranscriptionProvider { + transcribe(inputPath: string): Promise; +} +``` + +This keeps the command simple and makes it easier to add new providers later. + +## Step 3: Configure API Keys Safely + +Create a local environment file for development: + +```bash +cp .env.example .env +``` + +Add only the keys you need: + +```bash +OPENAI_API_KEY=your_openai_key_here +GROQ_API_KEY=your_groq_key_here +``` + +Make sure `.env` is ignored by Git: + +```bash +grep -n "\.env" .gitignore +``` + +If the file is not ignored, add it before continuing. Credentials should be read +from `process.env` or an equivalent runtime configuration mechanism. + +## Step 4: Add an OpenAI Transcription Provider + +Create a provider module that accepts a file path and returns plain text. The +exact SDK may change over time, so keep this code small and easy to update: + +```ts +import fs from 'node:fs'; +import OpenAI from 'openai'; + +const client = new OpenAI({ + apiKey: process.env.OPENAI_API_KEY, +}); + +export async function transcribeWithOpenAI(inputPath: string): Promise { + if (!process.env.OPENAI_API_KEY) { + throw new Error('OPENAI_API_KEY is required'); + } + + const result = await client.audio.transcriptions.create({ + file: fs.createReadStream(inputPath), + model: 'whisper-1', + }); + + return result.text ?? ''; +} +``` + +The important parts are validation, streaming the file from disk, and returning +a normalized string so the rest of the app does not depend on provider response +format. + +## Step 5: Add a Command-Line Wrapper + +Create a small command that accepts an input file and output path: + +```ts +import fs from 'node:fs/promises'; +import { transcribeWithOpenAI } from '../providers/openai-transcription'; + +const input = process.argv[2]; +const output = process.argv[3] ?? 'transcript.md'; + +if (!input) { + console.error('Usage: npm run transcribe -- ./sample.mp3 transcript.md'); + process.exit(1); +} + +const transcript = await transcribeWithOpenAI(input); +await fs.writeFile(output, `${transcript}\n`, 'utf8'); + +console.log(`Transcript written to ${output}`); +``` + +Then expose it in `package.json`: + +```json +{ + "scripts": { + "transcribe": "tsx src/commands/transcribe.ts" + } +} +``` + +Now the workflow is simple enough for repeat use: + +```bash +npm run transcribe -- ./sample.mp3 transcript.md +``` + +## Step 6: Add Groq or Another Provider + +To support another API, create a second provider with the same return contract: + +```ts +export async function transcribeWithGroq(inputPath: string): Promise { + if (!process.env.GROQ_API_KEY) { + throw new Error('GROQ_API_KEY is required'); + } + + // Call the provider's audio transcription endpoint here. + // Return only the final transcript string. + return ''; +} +``` + +Then choose the provider with an environment variable or CLI flag: + +```bash +TRANSCRIPTION_PROVIDER=openai npm run transcribe -- ./sample.mp3 transcript.md +``` + +A provider switch is cleaner than duplicating commands for every API. + +## Step 7: Test with a Short Audio File + +Before using long recordings, test the tool with a small file: + +```bash +npm run transcribe -- ./sample.mp3 transcript.md +sed -n '1,40p' transcript.md +``` + +Check for: + +- Empty transcript output. +- Authentication failures. +- Unsupported audio formats. +- Rate-limit or file-size errors. +- Incorrect language detection. + +Once the short test works, try a longer recording and measure cost and latency. + +## Step 8: Improve the Transcript + +Raw transcripts are useful, but most teams need additional cleanup. Common next +steps include: + +- Speaker labels for interviews or calls. +- Timestamps for subtitles and review. +- Markdown formatting for notes. +- Automatic summaries. +- Keyword extraction. +- Export to `.srt`, `.vtt`, or `.docx`. + +Keep these post-processing steps separate from the provider call. That way, the +same cleanup pipeline can be reused with OpenAI, Groq, or any future provider. + +## Troubleshooting + +### The API key is not detected + +Confirm the variable is loaded in the same shell where you run the command: + +```bash +echo $OPENAI_API_KEY +``` + +If you use a `.env` file, load it with your runtime or a package such as +`dotenv`. + +### The file format is rejected + +Convert the audio to a common format such as MP3 or WAV: + +```bash +ffmpeg -i input.m4a sample.mp3 +``` + +### Large files fail + +Split long audio into smaller chunks before sending it to the provider. This +also makes retries cheaper when a single request fails. + +## Conclusion + +Sapat can be used as a practical base for a small AI transcription workflow. By +running it inside Daytona, keeping credentials in environment variables, and +isolating provider-specific code, you get a setup that is reproducible, safe, +and easy to extend. Start with one provider, test with short audio, then add +more APIs or transcript cleanup steps as your workflow grows.