Design and implement a file driver that manages the sequential processing of market data. The system should intelligently find, iterate and process market updates in sequential order based on their arrival time.
Especial consideration should be placed on:
- The memory usage of the program. The files are too large to be loaded in memory at once, so a low-memory solution is required.
- The data comes as a
.parquetfile, so the processing of these files should take into account the decoding, of the file, as well as the transformation into the desired type. The desired type is left as an exercise to the implementer.
There will be 3 different types of market data to process.
This update is generated when there's any update in price. The update reflects the state of the top of the book the moment of the change in price. The relevant data to process for this assignment is:
| Name | ColumnName | Description |
|---|---|---|
| ts | exchangeTimestamp | timestamp of the market update (At the exchange) |
| rts | receiptTimestamp | timestamp of arrival of the market update (At your server) |
| bid_price | bid | bid price of the market update |
| ask_price | ask | ask price of the market update |
| bid_size | bidVolume | bid size of the market update |
| ask_size | askVolume | ask size of the market update |
This update is generated when there's a trade executed. The update contains information about the executed trade including price and quantity. The relevant data to process for this assignment is:
| Name | ColumnName | Description |
|---|---|---|
| ts | exchangeTimestamp | timestamp of the trade execution (At the exchange) |
| rts | receiptTimestamp | timestamp of arrival of the trade update (At your server) |
| price | price | executed trade price |
| size | volume | executed trade quantity |
| side | isBuySide | side of the aggressor (buy/sell) |
Note: isBuySide should be transformed to 1 if the agressor is buy, -1 otherwise.
This update provides a snapshot of the full order book state at a point in time. The update contains all price levels with their respective sizes on both bid and ask sides. The relevant data to extract and process for this assignment is:
| Name | ColumnName | Description |
|---|---|---|
| ts | exchangeTimestamp | timestamp of the order book state (At the exchange). |
| rts | receiptTimestamp | timestamp of arrival of the snapshot (At your server) |
| bids | bids | array of [price, size] pairs for bid side |
| asks | asks | array of [price, size] pairs for ask side |
Note: This update comes as two separate order book updates. You can assume the updates are symmetrical and the ts and rts fields will be the same on both sides of the book.
Your solution should demonstrate how you approach the following scenarios:
- Handling the sequential processing of files, taking into account:
- There are several files per day for each datatype.
- Files will be ordered meaning
2024-07-07.binance.BTCUSDT.00.parquetwill precede2024-07-07.binance.BTCUSDT.01.parquet
- You'll have a class, called
Strategythat with the following methods.process_data()process_tick_update(symbol_name, update)process_trade_update(symbol_name, update)process_book_snapshot_update(symbol_name, update)
- The updates should: be read -> parsed and transformed -> fed sequentially to the strategy object.
- Efficient file reading and parsing of large parquet files
- A clean implementation focusing on the core market data processing functionality
- Good memory management and data streaming practices
- Readable and well-structured code that separates concerns between data ingestion and processing
- Thoughtful handling of data ordering, sequencing and timestamps
- Don't use libraries such as
pandas or numpy - Don't load more than a single update in memory per dataType at a time.
- This is a focused problem. If your solution is overly complex, take a step back
Note: While this represents a suggested structure as a starting point, you are encouraged to adapt and modify the implementation according to your needs and design choices.
class DataReader:
def __init__(self, file_paths: Iterable[Path]) -> None:
def load_next_file(self) -> None:
"""loads the following file into the FileDriver"""
def get_next_update(self) -> None:
"""Returns the next raw update from the file. If we get a StopIteration,
we reached the end of the file and should close it, and if there's another one, load it"""
def parse(self) -> None:
pass
def update(self) -> None:
pass
class TickDataReader(DataReader):
def __init__(self, file_paths: Iterable[Path]) -> None:
def parse(self) -> dict:
def update(self, update: dict, strategy: Strategy) -> None:
strategy.tick_update(self.symbol_name, update)
(...)class Driver:
def __init__(self, market_data: list[DataReader], strategy: Strategy) -> None:
def remove_file(self, flat_file: DataReader) -> None:
def start(self, start_timestamp: int) -> None:
def run(self) -> None:If time allows, consider implementing one of these extensions:
- Implement a configuration that allows our strategy to use
timestampinstead ofreceipt timestamp, with configurable fixed delay to test the importance of our roundtrip latency in production. - Suggest a implementation to increase/decrease the roundtrip latency accounting for "exchange clogging", which should add latency if the
timedeltabetween updates is low, and subtract latency if high. An exchange should be able to process ~5000 updates per second, and start experiencing some (capped) throttling after that. - Add support for
book_updatesfiles andrate_updatefiles, available on the same location.
Aim to spend about 1 hour on the implementation.
The code should be comparable to code you'd put in front of others for code review and put in production. It should address production concerns, but the number of concerns it addresses may be limited given the time constraint. Include what you can. If you're short on time, aim to make something unpolished that works rather than something polished but incomplete.
Include a README that explains:
- Your assumptions and the reasoning behind them
- Design decisions, especially for ambiguous aspects of the challenge
- How to run your solution
- What you would improve with more time
- Any challenges you faced
When you are done, make a PR to the repo with your proposed solution. Include your name on the PR.
Use the tools and language you are most proficient with to complete the solution.