Skip to content

Commit 9e2f01c

Browse files
Merge pull request #2 from OpenScience-Collective/docs/osa-knowledge-sync
docs: add OSA knowledge sync documentation
2 parents 3189132 + b08afb0 commit 9e2f01c

4 files changed

Lines changed: 655 additions & 0 deletions

File tree

docs/osa/cli-reference.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,39 @@ uv run osa config set api_url http://localhost:38528
7979

8080
Configuration is stored in `~/.config/osa/config.yaml`.
8181

82+
### `osa sync`
83+
84+
Sync knowledge sources (GitHub issues/PRs, academic papers). See [Knowledge Sync](knowledge-sync.md) for details.
85+
86+
```bash
87+
# Initialize database
88+
uv run osa sync init
89+
90+
# Sync GitHub issues/PRs from HED repos
91+
uv run osa sync github
92+
93+
# Sync academic papers
94+
uv run osa sync papers
95+
96+
# Sync everything
97+
uv run osa sync all
98+
99+
# Check status
100+
uv run osa sync status
101+
102+
# Search (for testing)
103+
uv run osa sync search "validation error"
104+
```
105+
106+
Options for `sync github`:
107+
108+
- `-r, --repo`: Specific repo to sync (e.g., `hed-standard/hed-specification`)
109+
110+
Options for `sync papers`:
111+
112+
- `-s, --source`: Paper source (`openalex`, `semanticscholar`, `pubmed`)
113+
- `-q, --query`: Custom search query
114+
82115
## Configuration
83116

84117
### Environment Variables
@@ -89,6 +122,13 @@ Configuration is stored in `~/.config/osa/config.yaml`.
89122
| `OPENROUTER_API_KEY` | LLM provider API key | Required |
90123
| `LANGFUSE_PUBLIC_KEY` | LangFuse public key | Optional |
91124
| `LANGFUSE_SECRET_KEY` | LangFuse secret key | Optional |
125+
| `SYNC_ENABLED` | Enable automated knowledge sync | `true` |
126+
| `SYNC_GITHUB_CRON` | GitHub sync schedule (cron) | `0 2 * * *` |
127+
| `SYNC_PAPERS_CRON` | Papers sync schedule (cron) | `0 3 * * 0` |
128+
| `GITHUB_TOKEN` | GitHub token for sync | Optional |
129+
| `SEMANTIC_SCHOLAR_API_KEY` | Semantic Scholar API key | Optional |
130+
| `PUBMED_API_KEY` | PubMed/NCBI API key | Optional |
131+
| `DATA_DIR` | Data directory for knowledge DB | Platform-specific |
92132

93133
### Config File
94134

docs/osa/knowledge-sync.md

Lines changed: 261 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,261 @@
1+
# Knowledge Sync
2+
3+
OSA includes a knowledge discovery system that syncs GitHub discussions and academic papers for HED-related content. This helps the assistant link users to relevant discussions and research, not as authoritative knowledge sources, but for discovery.
4+
5+
## Overview
6+
7+
The knowledge database stores:
8+
9+
- **GitHub Issues and PRs** from HED repositories (hed-specification, hed-schemas, hed-javascript)
10+
- **Academic Papers** from OpenALEX, Semantic Scholar, and PubMed
11+
12+
!!! note "Discovery, Not Answers"
13+
The knowledge system is for **discovery only**. The assistant links users to relevant discussions ("There's a related issue: [link]") rather than answering from them.
14+
15+
## Quick Start
16+
17+
```bash
18+
# Initialize the database
19+
uv run osa sync init
20+
21+
# Sync GitHub issues/PRs
22+
uv run osa sync github
23+
24+
# Sync academic papers
25+
uv run osa sync papers
26+
27+
# Sync everything
28+
uv run osa sync all
29+
30+
# Check sync status
31+
uv run osa sync status
32+
```
33+
34+
## CLI Commands
35+
36+
### `osa sync init`
37+
38+
Initialize the knowledge database. Creates the SQLite database with FTS5 full-text search support.
39+
40+
```bash
41+
uv run osa sync init
42+
```
43+
44+
### `osa sync github`
45+
46+
Sync GitHub issues and PRs from HED repositories.
47+
48+
```bash
49+
# Sync all HED repos
50+
uv run osa sync github
51+
52+
# Sync specific repo
53+
uv run osa sync github -r hed-standard/hed-specification
54+
```
55+
56+
Options:
57+
58+
- `-r, --repo`: Specific repository to sync (e.g., `hed-standard/hed-specification`)
59+
60+
**Synced repositories:**
61+
62+
- `hed-standard/hed-specification`
63+
- `hed-standard/hed-schemas`
64+
- `hed-standard/hed-javascript`
65+
66+
### `osa sync papers`
67+
68+
Sync academic papers from multiple sources.
69+
70+
```bash
71+
# Sync from all sources
72+
uv run osa sync papers
73+
74+
# Sync from specific source
75+
uv run osa sync papers -s openalex
76+
uv run osa sync papers -s semanticscholar
77+
uv run osa sync papers -s pubmed
78+
79+
# Custom search query
80+
uv run osa sync papers -q "BIDS event annotation"
81+
```
82+
83+
Options:
84+
85+
- `-s, --source`: Paper source (`openalex`, `semanticscholar`, `pubmed`)
86+
- `-q, --query`: Custom search query (default: "HED annotation" OR "Hierarchical Event Descriptors")
87+
88+
### `osa sync all`
89+
90+
Sync all knowledge sources (GitHub + papers).
91+
92+
```bash
93+
uv run osa sync all
94+
```
95+
96+
### `osa sync status`
97+
98+
Show sync status and database statistics.
99+
100+
```bash
101+
uv run osa sync status
102+
```
103+
104+
Example output:
105+
106+
```
107+
Knowledge Database Status
108+
─────────────────────────
109+
Database: ~/.local/share/osa/knowledge/hed.db
110+
111+
GitHub Items:
112+
hed-standard/hed-specification: 45 issues, 23 PRs
113+
hed-standard/hed-schemas: 12 issues, 8 PRs
114+
hed-standard/hed-javascript: 18 issues, 5 PRs
115+
Last sync: 2026-01-12 02:00:00 UTC
116+
117+
Papers:
118+
OpenALEX: 42 papers
119+
Semantic Scholar: 38 papers
120+
PubMed: 25 papers
121+
Last sync: 2026-01-05 03:00:00 UTC
122+
```
123+
124+
### `osa sync search`
125+
126+
Search the knowledge database (for testing).
127+
128+
```bash
129+
uv run osa sync search "validation error"
130+
```
131+
132+
## Automated Sync (Docker)
133+
134+
When running OSA in Docker, the scheduler automatically syncs knowledge sources:
135+
136+
| Source | Default Schedule | Environment Variable |
137+
|--------|-----------------|---------------------|
138+
| GitHub | Daily at 2am UTC | `SYNC_GITHUB_CRON` |
139+
| Papers | Weekly Sunday 3am UTC | `SYNC_PAPERS_CRON` |
140+
141+
### Configuration
142+
143+
Configure via environment variables in your `.env` file:
144+
145+
```bash
146+
# Enable/disable automated sync
147+
SYNC_ENABLED=true
148+
149+
# Sync schedules (cron expressions, UTC timezone)
150+
SYNC_GITHUB_CRON=0 2 * * * # Daily at 2am
151+
SYNC_PAPERS_CRON=0 3 * * 0 # Weekly Sunday at 3am
152+
153+
# Optional API keys for higher rate limits
154+
GITHUB_TOKEN=ghp_...
155+
SEMANTIC_SCHOLAR_API_KEY=...
156+
PUBMED_API_KEY=...
157+
```
158+
159+
### Docker Compose
160+
161+
The included `docker-compose.yml` mounts a volume for database persistence:
162+
163+
```yaml
164+
services:
165+
osa:
166+
volumes:
167+
- osa-data:/app/data
168+
169+
volumes:
170+
osa-data:
171+
```
172+
173+
This ensures the knowledge database persists across container restarts.
174+
175+
## Manual Sync Trigger
176+
177+
You can manually trigger sync at any time:
178+
179+
```bash
180+
# Inside Docker container
181+
docker exec osa uv run osa sync all
182+
183+
# Or from host with CLI
184+
uv run osa sync all
185+
```
186+
187+
## Database Location
188+
189+
| Environment | Location |
190+
|-------------|----------|
191+
| Local (macOS) | `~/Library/Application Support/osa/knowledge/hed.db` |
192+
| Local (Linux) | `~/.local/share/osa/knowledge/hed.db` |
193+
| Docker | `/app/data/knowledge/hed.db` |
194+
195+
The location can be overridden with the `DATA_DIR` environment variable.
196+
197+
## API Keys
198+
199+
All API keys are optional but recommended for higher rate limits:
200+
201+
| API Key | Purpose | Get Key |
202+
|---------|---------|---------|
203+
| `GITHUB_TOKEN` | GitHub API (issues/PRs) | [GitHub Settings](https://github.com/settings/tokens) |
204+
| `SEMANTIC_SCHOLAR_API_KEY` | Semantic Scholar API | [S2 API](https://www.semanticscholar.org/product/api) |
205+
| `PUBMED_API_KEY` | PubMed/NCBI API | [NCBI Settings](https://www.ncbi.nlm.nih.gov/account/settings/) |
206+
207+
## Agent Tools
208+
209+
The HED assistant has access to knowledge discovery tools:
210+
211+
### `search_hed_discussions`
212+
213+
Search GitHub issues and PRs for related discussions.
214+
215+
```
216+
"Can you find any discussions about validation errors?"
217+
→ "There's a related discussion in hed-specification#123: [link]"
218+
```
219+
220+
### `search_hed_papers`
221+
222+
Search academic papers related to HED.
223+
224+
```
225+
"Are there papers about HED in neuroimaging?"
226+
→ "I found a relevant paper: 'HED Annotation Best Practices' [link]"
227+
```
228+
229+
## Troubleshooting
230+
231+
### Sync fails with "gh: command not found"
232+
233+
The `gh` CLI is required for GitHub sync. Install it:
234+
235+
```bash
236+
# macOS
237+
brew install gh
238+
239+
# Ubuntu/Debian
240+
sudo apt install gh
241+
```
242+
243+
### Rate limiting
244+
245+
If you hit rate limits, configure API keys in your `.env` file. Without keys:
246+
247+
- GitHub: 60 requests/hour
248+
- Semantic Scholar: ~100 requests/5 minutes
249+
- PubMed: 3 requests/second
250+
251+
With keys, limits are significantly higher.
252+
253+
### Database corruption
254+
255+
If the database becomes corrupted, delete and reinitialize:
256+
257+
```bash
258+
rm ~/.local/share/osa/knowledge/hed.db
259+
uv run osa sync init
260+
uv run osa sync all
261+
```

0 commit comments

Comments
 (0)