This repo contains a runnable prototype of an HTTP server that provides a REST API used for website traffic analytics. While the server can be run and tested (see instructions below), it is intended as a basis for discussion for a future productive version.
The repository consists of 4 modules:
flowchart LR
model["<b>model</b><br>Data types & API models"]
persistence["<b>persistence</b><br>Database access & queries"]
server["<b>server</b><br>HTTP endpoints & business logic"]
utils["<b>utils</b><br>Helpers & shared utilities"]
Further details on requirements & design can be found below.
This section contains a short summary of the functional and non-functional requirements for the prototype.
- provide runnable http server with REST API
- should run locally to support further design discussions & development activities
- API should accept GET and POST requests that accept query parameters:
POST /analytics?timestamp={millis_since_epoch}&user={user_id}&event={click|impression}
GET /analytics?timestamp={millis_since_epoch}
- timestamps are in milliseconds since epoch, GMT timezone
- endpoints should return the following responses:
- POST: 204 on success, 400 on invalid input, 500 if anything else goes wrong, empty response body
- GET: 200 on success, 400 on invalid input, 500 if anything else goes wrong, return JSON response body
- json should contain the result of aggregating all events received in the hour corresponding to the provided timestamp
- count (unique) users, sum click & impression events respectively
- a sample json response is shown below:
{
"unique_users": 348,
"clicks": 1234,
"impressions": 5678
}- handle 'time-skewed' traffic pattern efficiently
- we expect 95% of requests (
POST&GET) to refer to the current hour, 5% to past hours- queries related to past hours should be handled efficiently, but can be slower than current hour queries (up to 30% admissible)
- we expect 95% of requests (
- allow efficient aggregation of metrics per hour
GETrequests return hourly aggregates, P99 should remain below 200ms for current hour, below 300ms for past hours
- support high throughput of incoming requests
- once the service is productive, we expect up to 100
POSTrequests per second, and up to 20GETrequests per second - the design should allow for easy scaling in the future
- once the service is productive, we expect up to 100
- failure tolerance:
- emphasize type safety where possible, catch potential errors at compile time
- server restarts/crashes should not lead to data loss
- in a first version, it is ok for in-flight requests to be lost during a crash/restart
- implementation in Scala, using typelevel stack were possible (according to team skills & experience)
- single records are never updated
- timestamps of incoming
POSTrequests are not always increasing monotonically and can refer to past hours (only about ~5% of requests)- within the active hour, we assume incoming requests to be mostly ordered, but not strictly
- we assume
POSTrequests that refer to past hours are decreasing in non-linear fashion, the further we go into the past
This section contains a design overview, including components, tech stack and rejected design approaches.
The high-level design of the system is as follows:
flowchart TD
User["User"]
Server["HTTP Server"]
DB["Database"]
User <-->|POST /analytics| Server
User <-->|GET /analytics| Server
Server <-->|Read/Write| DB
- User: represents the entity sending requests to the server, can also be another application etc.
- Server: the main component of this project, handles incoming requests and interacts with the database to store and retrieve analytics data.
- Database: used to persist analytics data, ensuring durability and supporting efficient querying.
- 🚀 scalability
- to address high throughput requirements, use async, non-blocking HTTP server framework
- further scaling can be achieved through horizontal scaling --> multiple (containerized instances behind loadbalancer)
- ⏩ streaming / chunking
- client <--> server: clients can choose to stream http server responses if required (up to http client implementation)
- server <--> database:
- each request's payload (both
POSTandGET) is small, streaming requests / responses offers little benefits in this scenario - responses are small as well, "heavy lifting" is done on DB side
- therefore, no streaming required between server and database; focus on async / concurrency for high throughput instead
- each request's payload (both
- 📦 batching requests
- to save network round trips, we could batch records to be inserted into the database on the server
- trade-off: less durability (if in-memory buffer) or higher complexity (e.g. use redis) vs potentially higher throughput, less write contention on DB
- the proposed design allows for such an extension and already has some basic batching (DB inserts) implemented; however not fully implemented yet;
- 🧰 tool stack
- avoid unnecessary complexity, stay within single toolstack
- robust, type-safe implementation
- 💥 failure handling / robustness
- avoid data loss in case of server crashes / restarts --> use persistent storage
- type safe implementation to catch potential errors at compile time; some "unsafety" still to be eliminated from first implementation
- error handling for invalid requests, unexpected failures etc.
- logging of errors for further analysis and basic response code handling implemented
- retries for transient failures (e.g. DB connection issues) not yet implemented
- more robust error handling to be added in future iterations
- 🚦 testing
- allow for fast iterations during development: unit tests --> integration tests --> end-to-end tests
- unit tests for individual components using scalatest and mock http server
- integration tests using testcontainers to run DB in container (not yet implemented)
- end-to-end tests using docker-compose setup (implemented)
- load testing e.g. using load test client firing requests (not yet implemented)
-
trade-offs:
- distributed vs non-distributed
- distributed: increased scalability but higher complexity --> not required for initial implementation
- sql / relational vs no-sql: choose robust OSS relational DB for initial implementation
- data structure for response / request known --> relational model fits well
- no table joins required --> would also justify document model
- potentially high-volume, concurrent write requests, however append only
- no write conflicts, but consider load & write contention
- concurrent read requests --> no particularly strong consistency requirements for reading since analytical use case
- time-based aggregation required --> choose DB with good support for time-series data
- distributed vs non-distributed
-
further optimizations:
- compaction for past hours
- delay compaction to avoid decompression on older inserts --> start with default strategy, tune this later
- move old partitions / chunks to slower / object storage
- not implemented in first version, can be added later if required
- compaction for past hours
- ✅ decision 1: use high-throughput async HTTP server framework --> tapir with http4s backend
- ✅ decision 2: use non-distributed relational database with good support for time-serials data --> PostgreSQL with TimescaleDB extension
- ✅ decision 3: use easy-to-integrate database access library with good type safety --> doobie
- use in-memory storage instead of persistence --> lack of robustness
- use intermediate caching (e.g. redis) for speed & ordering --> increased complexity, possible extension
- JDK 21 or higher (JDK11+ should work, however testing was done for JDK21 only so far)
- Local docker installation (server + db are run using docker-compose for testing)
To build and run the HTTP server, use the project's build tool sbt. Simply execute run from the sbt shell.
See the testing section below on how to interact with the server once it's running.
Assuming you have the pre-requisites installed, you can run the server and database using docker-compose from the project's root directory:
docker-compose up --buildOnce the services are up, the HTTP server will be accessible at http://localhost:8080 (assuming you are running the server locally).
You can test the HTTP endpoints using curl, e.g. for GET:
curl http://127.0.0.1:8080/analytics/?millis_since_epoch=1761504097You can also use the server's Swagger UI for testing the endpoints, accessible at http://localhost:8080/docs.
- remove older chunks from DB and offload e.g. to a analytical db / data lake for long-term storage & analysis
- address authentication / authorization for productive use
- For DB connections, a connection pool should be used to improve performance under load (already available, to be used for http server, see: https://github.com/typelevel/doobie/blob/main/modules/example/src/main/scala/example/HikariExample.scala)
- Initial table creation is done using an init script executed during
docker-compose up; for productive use, a proper migration strategy should be implemented - Implement a load testing setup to validate performance requirements
- Further performance optimizations can be explored, e.g. batching of incoming requests, tuning of DB indices etc.
- Apart from local deployment, a production-ready deployment approach needs to be defined e.g. for K8s
- Depending on deployment & target environment, an observability strategy needs to be defined