Reference C++ pipeline for multi-camera face recognition on NVIDIA DeepStream + TensorRT, with FAISS GPU vector search.
Multi-camera face recognition is no longer a model problem. Modern detectors (SCRFD, YOLOFace, RetinaFace) and embedders (ArcFace and friends) have been solved for years. The hard, unglamorous engineering lives between the model and the wire:
- Pipeline plumbing. Keeping a GStreamer / DeepStream pipeline alive while sources connect, disconnect, and stall under load. Backpressure, graceful source removal, batched-push timing, EOS handling.
- Batched, asynchronous inference. Detection runs per stream, but the encoder wants its 32+ aligned crops in one TensorRT call to amortise the context switch. That requires a probe that collects detections across streams within a single frame, performs alignment, and pushes a single batch to the encoder.
- Vector search at scale. FAISS IVF-Flat is excellent up to roughly 100K
enrolled identities; beyond that you need IVF-PQ or sharding, and the cost
of getting
nlist,nprobe, and the metric wrong shows up as silent recall regressions. The index has to live on the GPU but be persisted from CPU. - Decision logic. Top-1 similarity alone is a recipe for false positives in dense enrollments. A margin against top-2 catches most of them; a per-track confirmation layer catches the rest.
This repository is a clean-room reference implementation of those patterns. It is not a fork of any production system; the code is original and intentionally focused on showcasing structure and engineering decisions rather than every feature you would ship.
- C++17 + CMake build targeting DeepStream 7.x / 8.x and TensorRT 8.6+
- SCRFD detector wrapper (three-stride decoder, NMS, letterboxed input)
- ArcFace ResNet50 embedder with batched TensorRT inference
- 5-point face aligner using a Umeyama similarity transform
- FAISS GPU index with adaptive IVF-Flat / IVF-PQ selection
- Probe chain that batches detections across streams before encoding
- Multi-source DeepStream pipeline with thread-safe
add_source/remove_source - Two CLI tools:
face_enroll(build an index from a public dataset) andface_benchmark(per-stage latency / throughput) - Docker + docker-compose for reproducible runs
- Structured JSON logging (
spdlog)
┌──────────┐ ┌──────────┐ ┌────────────┐ ┌─────────┐
RTSP / file ──► │ uridecode│ ──►│nvstream │ ──►│ nvinfer │ ──►│ appsink │
│ bin │ │ mux │ │ (SCRFD) │ │ │
└──────────┘ └──────────┘ └─────┬──────┘ └─────────┘
│
src-pad probe (parses tensor meta)
▼
┌────────────────────────────────────────────────────┐
│ ProbeChain │
│ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
│ │ align │──►│ encode │──►│ FAISS search + │ │
│ │ (5-point)│ │ (ArcFace)│ │ margin filter │ │
│ └──────────┘ └──────────┘ └────────────────┘ │
└────────────────────────────────────────────────────┘
│
▼
FrameResult callback
(logging / Redis / DB)
Indicative numbers on synthetic 720p streams, RTX 3090, batch_size=8. Real numbers depend heavily on input resolution, face count per frame, and chosen index parameters; treat these as a sanity floor, not a benchmark.
| Stage | p50 latency | Throughput |
|---|---|---|
| SCRFD detection (per frame) | ~7 ms | ~140 FPS @ batch 8 |
| ArcFace encoding (per face) | ~2 ms | ~500 faces/s @ batch 32 |
| FAISS search, n=10K, top-5 | ~0.3 ms | ~3K queries/s |
| FAISS search, n=100K, top-5 | ~0.9 ms | ~1K queries/s |
tools/face_benchmark regenerates the FAISS numbers locally; the rest
require trained engines built from the public InsightFace ONNX
checkpoints (see scripts/download_models.sh).
# 1. Build
cmake -S . -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# 2. Get the public ONNX checkpoints (insightface buffalo_l)
./scripts/download_models.sh
# 3. Compile TensorRT engines
./scripts/build_engines.sh
# 4. (Optional) generate four synthetic test streams
./scripts/generate_test_streams.sh 4
# 5. Run
./build/face_server --config configs/system_config.yaml \
--pgie configs/pgie_scrfd.txtOr, with Docker:
docker compose up --build.
├── CMakeLists.txt
├── cmake/ # Find* modules + compiler settings
├── configs/ # YAML system config + pgie_scrfd.txt
├── docker/ # Dockerfile + entrypoint
├── docker-compose.yml
├── include/face_pipeline/ # Public headers
│ ├── align/ # Face aligner
│ ├── config/ # System config types
│ ├── indexing/ # FAISS searcher
│ ├── pipeline/ # DeepStream pipeline + probe chain
│ ├── trt/ # TRT engine, SCRFD, ArcFace
│ └── utils/ # Logger, CUDA helpers
├── src/ # Implementation files (mirrors include/)
├── tools/ # face_enroll.cpp, benchmark.cpp
├── scripts/ # download_models.sh, build_engines.sh, ...
└── docs/
You will need:
- CMake 3.22+
- A C++17 compiler (GCC 11+ recommended)
- CUDA Toolkit 12.x
- TensorRT 8.6+ (matching your CUDA version)
- DeepStream SDK 7.x or 8.x
- OpenCV 4.5+ with CUDA modules (
cudaimgproc,cudawarping) - Eigen 3.4+
- spdlog, yaml-cpp, FAISS (GPU build)
- gstreamer-1.0 development headers
The Docker build (docker/Dockerfile) uses the official NVIDIA DeepStream
devel image as the toolchain and is the most portable way to build.
configs/system_config.yaml is the single source of truth for runtime
parameters. The most important sections:
pipeline.batch_size— must match the engine's max batch and thebatch-sizefield inpgie_scrfd.txt.detection.confidence_threshold,detection.nms_iou_threshold— affect recall and clutter; calibrate against your input distribution.faiss.index_type—ivf_flatorivf_pq.ivf_pqis automatically chosen abovefaiss.ivf_pq_min_sizeenrollments.recognition.threshold,recognition.margin_min— the gate for reporting a positive match. Margin is more important than absolute threshold for dense enrollments.
- Only the face track is implemented. Real deployments often need a second pass over body crops (re-ID) or multi-mode tracking; both are out of scope for this reference.
- No persistent enrollment storage.
face_enrollwrites a FAISS index file directly; pgvector / RDBMS integration is left as an exercise. - No gRPC or REST surface in this reference. The
DeepStreamPipelineexposesadd_source/remove_sourceprogrammatically and is meant to be wrapped by whichever transport you prefer. - Engines must be FP16/FP32; INT8 calibration scaffolding is sketched in
scripts/build_engines.shbut not validated end-to-end.
- gRPC façade for camera management and identity enrollment
- PostgreSQL + pgvector store as a fallback / persistence layer
- NvDCF tracker integration with per-track recognition fusion
- INT8 calibration recipe with a reproducible calibration set
- GoogleTest suite for the algorithmic stages
MIT — see LICENSE.
This repository is a reference implementation of techniques and patterns I have used in production face-recognition systems. It uses public algorithms (SCRFD, ArcFace, FAISS) and synthetic / public datasets only; no proprietary code, configurations, or data are included.