Time-boxed: 4–6 hours.
This readme is for explaining the interview request and objectives. Please find my solution report in solution_report.md.
Use a language model (≤ 1B parameters) to solve 4-digit by 4-digit multiplication:
1222x3399=4153578
How you get there is up to you - train from scratch, fine-tune a pretrained checkpoint, prompt a frozen one, or anything in between.
A parser plus a calculator solves multiplication perfectly in one line. We know. This is a deliberately simple-looking - but not actually simple - problem chosen as a controlled probe of how well you understand language-modeling and its limits.
- Package quality - readability, structure, reproducibility.
- Depth of understanding - can you walk us through every design choice and its trade-offs without hand-waving?
- Honest analysis of what your model actually does and where it fails.
- Metric performance - last.
generate_dataset.py- producestrain.jsonl/test.jsonlof multiplication examples (real, working).run.py- runnable inference + evaluation pipeline; supports--dataand--prompt.math_nlp/- package skeleton with three replaceable hooks:transform.py-Transformplaceholder (identity), an optional text-level transformation between the canonical example and the form your model consumes / produces.tokenizer.py-Tokenizerplaceholder (identity).model.py-Modelplaceholder (random integer).
data.pyandevaluate.pyare real, working pieces (sampling, formatting, JSONL I/O, accuracy).
The placeholders are deliberate. The pipeline runs end-to-end with them so you only need to swap in real components.
There is no single right answer along any of these axes. The interesting work is choosing positions and explaining the trade-offs.
- Modeling. From-scratch decoder, fine-tuned pretrained checkpoint, pure prompting / in-context learning of a frozen model, …?
- Tokenization. Off-the-shelf tokenizer, hand-rolled vocabulary, what level of granularity?
- Data transformation. Does the canonical text format make the task easier or harder for the model than it has to be?
Whatever you pick along each axis, we will ask you why.
- How do humans do multiplication?
- Generate data; decide the size and distribution; justify.
- Pick and implement a modeling strategy.
- Pick and implement a tokenizer.
- Decide whether (and how) to transform the data.
- Wire everything into
run.pyso thatpython run.py --data data/test.jsonlreports accuracy (is that the right metric?) andpython run.py --prompt AxBreturns a prediction. - Write a short notebook: your decisions and why, what you observed, where the model fails, what you would do with another day.
python generate_dataset.py --n 10000 --test-frac 0.1 --seed 0
python run.py --data data/test.jsonl
python run.py --prompt 3344x1119
With the placeholders in place, evaluation accuracy is ~0 and --prompt
returns a random integer. That is expected.
JSON Lines, one example per line:
{"prompt": "1222x3399=", "completion": "4153578"}
Operands are sampled uniformly in [1000, 9999]; the product is a
natural number. Train and test are guaranteed disjoint
(sampling is without replacement across the full dataset).
- Autoregressive language model, ≤ 1B parameters. Tokenization and text format are yours to choose.
If anything in the brief is ambiguous, ask the interviewer.