Simulation Research Driver Challenge

Summary

Design and implement a file driver that manages the sequential processing of market data. The system should intelligently find, iterate and process market updates in sequential order based on their arrival time.

Especial consideration should be placed on:

The memory usage of the program. The files are too large to be loaded in memory at once, so a low-memory solution is required.
The data comes as a .parquet file, so the processing of these files should take into account the decoding, of the file, as well as the transformation into the desired type. The desired type is left as an exercise to the implementer.

There will be 3 different types of market data to process.

Tick data

This update is generated when there's any update in price. The update reflects the state of the top of the book the moment of the change in price. The relevant data to process for this assignment is:

Name	ColumnName	Description
ts	exchangeTimestamp	timestamp of the market update (At the exchange)
rts	receiptTimestamp	timestamp of arrival of the market update (At your server)
bid_price	bid	bid price of the market update
ask_price	ask	ask price of the market update
bid_size	bidVolume	bid size of the market update
ask_size	askVolume	ask size of the market update

Trade data

This update is generated when there's a trade executed. The update contains information about the executed trade including price and quantity. The relevant data to process for this assignment is:

Name	ColumnName	Description
ts	exchangeTimestamp	timestamp of the trade execution (At the exchange)
rts	receiptTimestamp	timestamp of arrival of the trade update (At your server)
price	price	executed trade price
size	volume	executed trade quantity
side	isBuySide	side of the aggressor (buy/sell)

Note: isBuySide should be transformed to 1 if the agressor is buy, -1 otherwise.

Order Book data

This update provides a snapshot of the full order book state at a point in time. The update contains all price levels with their respective sizes on both bid and ask sides. The relevant data to extract and process for this assignment is:

Name	ColumnName	Description
ts	exchangeTimestamp	timestamp of the order book state (At the exchange).
rts	receiptTimestamp	timestamp of arrival of the snapshot (At your server)
bids	bids	array of [price, size] pairs for bid side
asks	asks	array of [price, size] pairs for ask side

Note: This update comes as two separate order book updates. You can assume the updates are symmetrical and the ts and rts fields will be the same on both sides of the book.

System Behavior

Your solution should demonstrate how you approach the following scenarios:

Handling the sequential processing of files, taking into account:
- There are several files per day for each datatype.
- Files will be ordered meaning 2024-07-07.binance.BTCUSDT.00.parquet will precede 2024-07-07.binance.BTCUSDT.01.parquet
You'll have a class, called Strategy that with the following methods.
- process_data()
  - process_tick_update(symbol_name, update)
  - process_trade_update(symbol_name, update)
  - process_book_snapshot_update(symbol_name, update)
The updates should: be read -> parsed and transformed -> fed sequentially to the strategy object.

What we're looking for

Efficient file reading and parsing of large parquet files
A clean implementation focusing on the core market data processing functionality
Good memory management and data streaming practices
Readable and well-structured code that separates concerns between data ingestion and processing
Thoughtful handling of data ordering, sequencing and timestamps

What we're not looking for

Don't use libraries such as pandas or numpy
Don't load more than a single update in memory per dataType at a time.
This is a focused problem. If your solution is overly complex, take a step back

Suggested architecture

Note: While this represents a suggested structure as a starting point, you are encouraged to adapt and modify the implementation according to your needs and design choices.

class DataReader:
    def __init__(self, file_paths: Iterable[Path]) -> None:

    def load_next_file(self) -> None:
        """loads the following file into the FileDriver"""

    def get_next_update(self) -> None:
        """Returns the next raw update from the file. If we get a StopIteration,
        we reached the end of the file and should close it, and if there's another one, load it"""

    def parse(self) -> None:
        pass

    def update(self) -> None:
        pass


class TickDataReader(DataReader):
    def __init__(self, file_paths: Iterable[Path]) -> None:

    def parse(self) -> dict:

    def update(self, update: dict, strategy: Strategy) -> None:
        strategy.tick_update(self.symbol_name, update)

(...)

class Driver:
    def __init__(self, market_data: list[DataReader], strategy: Strategy) -> None:

    def remove_file(self, flat_file: DataReader) -> None:

    def start(self, start_timestamp: int) -> None:

    def run(self) -> None:

Extension Options

If time allows, consider implementing one of these extensions:

Implement a configuration that allows our strategy to use timestamp instead of receipt timestamp, with configurable fixed delay to test the importance of our roundtrip latency in production.
Suggest a implementation to increase/decrease the roundtrip latency accounting for "exchange clogging", which should add latency if the timedelta between updates is low, and subtract latency if high. An exchange should be able to process ~5000 updates per second, and start experiencing some (capped) throttling after that.
Add support for book_updates files and rate_update files, available on the same location.

Instructions

Aim to spend about 1 hour on the implementation.

The code should be comparable to code you'd put in front of others for code review and put in production. It should address production concerns, but the number of concerns it addresses may be limited given the time constraint. Include what you can. If you're short on time, aim to make something unpolished that works rather than something polished but incomplete.

Include a README that explains:

Your assumptions and the reasoning behind them
Design decisions, especially for ambiguous aspects of the challenge
How to run your solution
What you would improve with more time
Any challenges you faced

When you are done, make a PR to the repo with your proposed solution. Include your name on the PR.

Use the tools and language you are most proficient with to complete the solution.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simulation Research Driver Challenge

Summary

Tick data

Trade data

Order Book data

System Behavior

What we're looking for

What we're not looking for

Suggested architecture

Extension Options

Instructions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Simulation Research Driver Challenge

Summary

Tick data

Trade data

Order Book data

System Behavior

What we're looking for

What we're not looking for

Suggested architecture

Extension Options

Instructions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages