Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
287 changes: 287 additions & 0 deletions guides/20260531_ai_transcription_tool_sapat.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@
---
title: 'Build an AI Transcription Tool with Sapat'
description:
'Use Sapat, Whisper-style speech-to-text APIs, and Daytona to build a reproducible AI transcription workflow.'
date: 2026-05-31
author: 'Thanhdn1984'
tags: ['ai', 'transcription', 'daytona', 'openai', 'groq']
---

# Build an AI Transcription Tool with Sapat

# Introduction

Audio is one of the most common formats for meetings, interviews, lectures, and
support calls, but it is awkward to search, summarize, or reuse until it becomes
text. Modern speech-to-text APIs make transcription easier, yet each provider
has different authentication, request formats, file limits, model names, and
response shapes. A small wrapper can save a lot of repeated integration work.

[Sapat](https://github.com/nkkko/sapat) is a lightweight project for calling AI
APIs from a simple developer workflow. In this guide, you will use it as the
base for an AI transcription tool that can accept an audio file, send it to a
speech-to-text provider, and save a clean transcript. The examples focus on
OpenAI and Groq-style APIs, but the same structure can be extended to other
providers later.

You will also run the project inside a reproducible Daytona workspace, which is
useful when you want a clean environment for testing API clients without
polluting your local machine.

## TL;DR

- Clone Sapat and open it in a Daytona workspace.
- Store API keys in environment variables, not source code.
- Add a small transcription command that reads an audio file and writes a
transcript.
- Test the workflow with a short sample before processing large recordings.
- Keep provider-specific code isolated so new APIs can be added safely.

## Prerequisites

Before starting, make sure you have:

- [Daytona](https://www.daytona.io/docs/installation/installation/) installed.
- Git installed.
- Node.js available in your workspace if the project uses JavaScript tooling.
- An API key for your chosen speech-to-text provider.
- A short audio file for testing, such as `sample.mp3` or `sample.wav`.

Do not commit API keys, personal recordings, or generated transcripts that may
contain private information.

## Step 1: Create a Daytona Workspace

Start by creating a clean workspace from the Sapat repository:

```bash
daytona create https://github.com/nkkko/sapat
```

Open the workspace in your editor:

```bash
daytona code sapat
```

Using Daytona gives you a disposable environment where dependencies, test audio,
and provider SDKs can be installed without affecting your main development
setup.

## Step 2: Inspect the Project

Inside the workspace, review the repository structure:

```bash
ls
find . -maxdepth 2 -type f | sort
```

Look for the current entry point, package manager files, and any existing API
client helpers. The goal is to avoid mixing transcription logic directly into
unrelated files. A maintainable structure usually looks like this:

```text
src/
providers/
openai-transcription.ts
groq-transcription.ts
commands/
transcribe.ts
```

If Sapat already has a provider abstraction, reuse it. If not, create a thin
interface that every transcription provider can implement:

```ts
export interface TranscriptionProvider {
transcribe(inputPath: string): Promise<string>;
}
```

This keeps the command simple and makes it easier to add new providers later.

## Step 3: Configure API Keys Safely

Create a local environment file for development:

```bash
cp .env.example .env
```

Add only the keys you need:

```bash
OPENAI_API_KEY=your_openai_key_here
GROQ_API_KEY=your_groq_key_here
```

Make sure `.env` is ignored by Git:

```bash
grep -n "\.env" .gitignore
```

If the file is not ignored, add it before continuing. Credentials should be read
from `process.env` or an equivalent runtime configuration mechanism.

## Step 4: Add an OpenAI Transcription Provider

Create a provider module that accepts a file path and returns plain text. The
exact SDK may change over time, so keep this code small and easy to update:

```ts
import fs from 'node:fs';
import OpenAI from 'openai';

const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});

export async function transcribeWithOpenAI(inputPath: string): Promise<string> {
if (!process.env.OPENAI_API_KEY) {
throw new Error('OPENAI_API_KEY is required');
}

const result = await client.audio.transcriptions.create({
file: fs.createReadStream(inputPath),
model: 'whisper-1',
});

return result.text ?? '';
}
```

The important parts are validation, streaming the file from disk, and returning
a normalized string so the rest of the app does not depend on provider response
format.

## Step 5: Add a Command-Line Wrapper

Create a small command that accepts an input file and output path:

```ts
import fs from 'node:fs/promises';
import { transcribeWithOpenAI } from '../providers/openai-transcription';

const input = process.argv[2];
const output = process.argv[3] ?? 'transcript.md';

if (!input) {
console.error('Usage: npm run transcribe -- ./sample.mp3 transcript.md');
process.exit(1);
}

const transcript = await transcribeWithOpenAI(input);
await fs.writeFile(output, `${transcript}\n`, 'utf8');

console.log(`Transcript written to ${output}`);
```

Then expose it in `package.json`:

```json
{
"scripts": {
"transcribe": "tsx src/commands/transcribe.ts"
}
}
```

Now the workflow is simple enough for repeat use:

```bash
npm run transcribe -- ./sample.mp3 transcript.md
```

## Step 6: Add Groq or Another Provider

To support another API, create a second provider with the same return contract:

```ts
export async function transcribeWithGroq(inputPath: string): Promise<string> {
if (!process.env.GROQ_API_KEY) {
throw new Error('GROQ_API_KEY is required');
}

// Call the provider's audio transcription endpoint here.
// Return only the final transcript string.
return '';
}
```

Then choose the provider with an environment variable or CLI flag:

```bash
TRANSCRIPTION_PROVIDER=openai npm run transcribe -- ./sample.mp3 transcript.md
```

A provider switch is cleaner than duplicating commands for every API.

## Step 7: Test with a Short Audio File

Before using long recordings, test the tool with a small file:

```bash
npm run transcribe -- ./sample.mp3 transcript.md
sed -n '1,40p' transcript.md
```

Check for:

- Empty transcript output.
- Authentication failures.
- Unsupported audio formats.
- Rate-limit or file-size errors.
- Incorrect language detection.

Once the short test works, try a longer recording and measure cost and latency.

## Step 8: Improve the Transcript

Raw transcripts are useful, but most teams need additional cleanup. Common next
steps include:

- Speaker labels for interviews or calls.
- Timestamps for subtitles and review.
- Markdown formatting for notes.
- Automatic summaries.
- Keyword extraction.
- Export to `.srt`, `.vtt`, or `.docx`.

Keep these post-processing steps separate from the provider call. That way, the
same cleanup pipeline can be reused with OpenAI, Groq, or any future provider.

## Troubleshooting

### The API key is not detected

Confirm the variable is loaded in the same shell where you run the command:

```bash
echo $OPENAI_API_KEY
```

If you use a `.env` file, load it with your runtime or a package such as
`dotenv`.

### The file format is rejected

Convert the audio to a common format such as MP3 or WAV:

```bash
ffmpeg -i input.m4a sample.mp3
```

### Large files fail

Split long audio into smaller chunks before sending it to the provider. This
also makes retries cheaper when a single request fails.

## Conclusion

Sapat can be used as a practical base for a small AI transcription workflow. By
running it inside Daytona, keeping credentials in environment variables, and
isolating provider-specific code, you get a setup that is reproducible, safe,
and easy to extend. Start with one provider, test with short audio, then add
more APIs or transcript cleanup steps as your workflow grows.