Skip to content

A sample implementation for a functional, async, configurable & high-throughput http server optimized for handling clickstream / webtraffic data

Notifications You must be signed in to change notification settings

akreit/http-analytics-server

Repository files navigation

Table of Contents

About

This repo contains a runnable prototype of an HTTP server that provides a REST API used for website traffic analytics. While the server can be run and tested (see instructions below), it is intended as a basis for discussion for a future productive version.

Repository Structure

The repository consists of 4 modules:

flowchart LR
    model["<b>model</b><br>Data types & API models"]
    persistence["<b>persistence</b><br>Database access & queries"]
    server["<b>server</b><br>HTTP endpoints & business logic"]
    utils["<b>utils</b><br>Helpers & shared utilities"]
Loading

Further details on requirements & design can be found below.

Requirements

This section contains a short summary of the functional and non-functional requirements for the prototype.

Functional

  • provide runnable http server with REST API
    • should run locally to support further design discussions & development activities
  • API should accept GET and POST requests that accept query parameters:
POST /analytics?timestamp={millis_since_epoch}&user={user_id}&event={click|impression}
GET /analytics?timestamp={millis_since_epoch}
  • timestamps are in milliseconds since epoch, GMT timezone
  • endpoints should return the following responses:
    • POST: 204 on success, 400 on invalid input, 500 if anything else goes wrong, empty response body
    • GET: 200 on success, 400 on invalid input, 500 if anything else goes wrong, return JSON response body
      • json should contain the result of aggregating all events received in the hour corresponding to the provided timestamp
      • count (unique) users, sum click & impression events respectively
      • a sample json response is shown below:
{
  "unique_users": 348,
  "clicks": 1234,
  "impressions": 5678
}

Non-functional

  • handle 'time-skewed' traffic pattern efficiently
    • we expect 95% of requests (POST & GET) to refer to the current hour, 5% to past hours
      • queries related to past hours should be handled efficiently, but can be slower than current hour queries (up to 30% admissible)
  • allow efficient aggregation of metrics per hour
    • GET requests return hourly aggregates, P99 should remain below 200ms for current hour, below 300ms for past hours
  • support high throughput of incoming requests
    • once the service is productive, we expect up to 100 POST requests per second, and up to 20 GET requests per second
    • the design should allow for easy scaling in the future
  • failure tolerance:
    • emphasize type safety where possible, catch potential errors at compile time
    • server restarts/crashes should not lead to data loss
    • in a first version, it is ok for in-flight requests to be lost during a crash/restart
  • implementation in Scala, using typelevel stack were possible (according to team skills & experience)

Assumptions

  • single records are never updated
  • timestamps of incoming POST requests are not always increasing monotonically and can refer to past hours (only about ~5% of requests)
    • within the active hour, we assume incoming requests to be mostly ordered, but not strictly
    • we assume POST requests that refer to past hours are decreasing in non-linear fashion, the further we go into the past

Design

This section contains a design overview, including components, tech stack and rejected design approaches.

Overview

The high-level design of the system is as follows:

flowchart TD
    User["User"]
    Server["HTTP Server"]
    DB["Database"]

    User <-->|POST /analytics| Server
    User <-->|GET /analytics| Server
    Server <-->|Read/Write| DB
Loading
  • User: represents the entity sending requests to the server, can also be another application etc.
  • Server: the main component of this project, handles incoming requests and interacts with the database to store and retrieve analytics data.
  • Database: used to persist analytics data, ensuring durability and supporting efficient querying.

Design considerations

HTTP Server

  • 🚀 scalability
    • to address high throughput requirements, use async, non-blocking HTTP server framework
    • further scaling can be achieved through horizontal scaling --> multiple (containerized instances behind loadbalancer)
  • ⏩ streaming / chunking
    • client <--> server: clients can choose to stream http server responses if required (up to http client implementation)
    • server <--> database:
      • each request's payload (both POST and GET) is small, streaming requests / responses offers little benefits in this scenario
      • responses are small as well, "heavy lifting" is done on DB side
      • therefore, no streaming required between server and database; focus on async / concurrency for high throughput instead
  • 📦 batching requests
    • to save network round trips, we could batch records to be inserted into the database on the server
    • trade-off: less durability (if in-memory buffer) or higher complexity (e.g. use redis) vs potentially higher throughput, less write contention on DB
    • the proposed design allows for such an extension and already has some basic batching (DB inserts) implemented; however not fully implemented yet;
  • 🧰 tool stack
    • avoid unnecessary complexity, stay within single toolstack
    • robust, type-safe implementation
  • 💥 failure handling / robustness
    • avoid data loss in case of server crashes / restarts --> use persistent storage
    • type safe implementation to catch potential errors at compile time; some "unsafety" still to be eliminated from first implementation
    • error handling for invalid requests, unexpected failures etc.
      • logging of errors for further analysis and basic response code handling implemented
      • retries for transient failures (e.g. DB connection issues) not yet implemented
      • more robust error handling to be added in future iterations
  • 🚦 testing
    • allow for fast iterations during development: unit tests --> integration tests --> end-to-end tests
    • unit tests for individual components using scalatest and mock http server
    • integration tests using testcontainers to run DB in container (not yet implemented)
    • end-to-end tests using docker-compose setup (implemented)
    • load testing e.g. using load test client firing requests (not yet implemented)

Database

  • trade-offs:

    • distributed vs non-distributed
      • distributed: increased scalability but higher complexity --> not required for initial implementation
    • sql / relational vs no-sql: choose robust OSS relational DB for initial implementation
      • data structure for response / request known --> relational model fits well
      • no table joins required --> would also justify document model
      • potentially high-volume, concurrent write requests, however append only
        • no write conflicts, but consider load & write contention
        • concurrent read requests --> no particularly strong consistency requirements for reading since analytical use case
      • time-based aggregation required --> choose DB with good support for time-series data
  • further optimizations:

    • compaction for past hours
      • delay compaction to avoid decompression on older inserts --> start with default strategy, tune this later
    • move old partitions / chunks to slower / object storage
      • not implemented in first version, can be added later if required

Derived design decisions

  • ✅ decision 1: use high-throughput async HTTP server framework --> tapir with http4s backend
  • ✅ decision 2: use non-distributed relational database with good support for time-serials data --> PostgreSQL with TimescaleDB extension
  • ✅ decision 3: use easy-to-integrate database access library with good type safety --> doobie

Rejected design approaches

  • use in-memory storage instead of persistence --> lack of robustness
  • use intermediate caching (e.g. redis) for speed & ordering --> increased complexity, possible extension

Running & Testing

Pre-requisites

  • JDK 21 or higher (JDK11+ should work, however testing was done for JDK21 only so far)
  • Local docker installation (server + db are run using docker-compose for testing)

sbt

To build and run the HTTP server, use the project's build tool sbt. Simply execute run from the sbt shell. See the testing section below on how to interact with the server once it's running.

docker

Assuming you have the pre-requisites installed, you can run the server and database using docker-compose from the project's root directory:

docker-compose up --build

Once the services are up, the HTTP server will be accessible at http://localhost:8080 (assuming you are running the server locally). You can test the HTTP endpoints using curl, e.g. for GET:

curl http://127.0.0.1:8080/analytics/?millis_since_epoch=1761504097

You can also use the server's Swagger UI for testing the endpoints, accessible at http://localhost:8080/docs.

Future Work

  • remove older chunks from DB and offload e.g. to a analytical db / data lake for long-term storage & analysis
  • address authentication / authorization for productive use
  • For DB connections, a connection pool should be used to improve performance under load (already available, to be used for http server, see: https://github.com/typelevel/doobie/blob/main/modules/example/src/main/scala/example/HikariExample.scala)
  • Initial table creation is done using an init script executed during docker-compose up; for productive use, a proper migration strategy should be implemented
  • Implement a load testing setup to validate performance requirements
  • Further performance optimizations can be explored, e.g. batching of incoming requests, tuning of DB indices etc.
  • Apart from local deployment, a production-ready deployment approach needs to be defined e.g. for K8s
  • Depending on deployment & target environment, an observability strategy needs to be defined

About

A sample implementation for a functional, async, configurable & high-throughput http server optimized for handling clickstream / webtraffic data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published