Technical test - NLP multiplication

Time-boxed: 4–6 hours.

This readme is for explaining the interview request and objectives. Please find my solution report in solution_report.md.

The task

Use a language model (≤ 1B parameters) to solve 4-digit by 4-digit multiplication:

1222x3399=4153578

How you get there is up to you - train from scratch, fine-tune a pretrained checkpoint, prompt a frozen one, or anything in between.

Why this task

A parser plus a calculator solves multiplication perfectly in one line. We know. This is a deliberately simple-looking - but not actually simple - problem chosen as a controlled probe of how well you understand language-modeling and its limits.

What we grade (in order)

Package quality - readability, structure, reproducibility.
Depth of understanding - can you walk us through every design choice and its trade-offs without hand-waving?
Honest analysis of what your model actually does and where it fails.
Metric performance - last.

What's provided

generate_dataset.py - produces train.jsonl / test.jsonl of multiplication examples (real, working).
run.py - runnable inference + evaluation pipeline; supports --data and --prompt.
math_nlp/ - package skeleton with three replaceable hooks:
- transform.py - Transform placeholder (identity), an optional text-level transformation between the canonical example and the form your model consumes / produces.
- tokenizer.py - Tokenizer placeholder (identity).
- model.py - Model placeholder (random integer).
data.py and evaluate.py are real, working pieces (sampling, formatting, JSONL I/O, accuracy).

The placeholders are deliberate. The pipeline runs end-to-end with them so you only need to swap in real components.

Design space (you decide)

There is no single right answer along any of these axes. The interesting work is choosing positions and explaining the trade-offs.

Modeling. From-scratch decoder, fine-tuned pretrained checkpoint, pure prompting / in-context learning of a frozen model, …?
Tokenization. Off-the-shelf tokenizer, hand-rolled vocabulary, what level of granularity?
Data transformation. Does the canonical text format make the task easier or harder for the model than it has to be?

Whatever you pick along each axis, we will ask you why.

What you do

How do humans do multiplication?
Generate data; decide the size and distribution; justify.
Pick and implement a modeling strategy.
Pick and implement a tokenizer.
Decide whether (and how) to transform the data.
Wire everything into run.py so that python run.py --data data/test.jsonl reports accuracy (is that the right metric?) and python run.py --prompt AxB returns a prediction.
Write a short notebook: your decisions and why, what you observed, where the model fails, what you would do with another day.

Quickstart

python generate_dataset.py --n 10000 --test-frac 0.1 --seed 0
python run.py --data data/test.jsonl
python run.py --prompt 3344x1119

With the placeholders in place, evaluation accuracy is ~0 and --prompt returns a random integer. That is expected.

Data format

JSON Lines, one example per line:

{"prompt": "1222x3399=", "completion": "4153578"}

Operands are sampled uniformly in [1000, 9999]; the product is a natural number. Train and test are guaranteed disjoint (sampling is without replacement across the full dataset).

Constraints

Autoregressive language model, ≤ 1B parameters. Tokenization and text format are yours to choose.

If anything in the brief is ambiguous, ask the interviewer.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
checkpoints/qwen_lora_atomic_n5000_e1_drop0.1		checkpoints/qwen_lora_atomic_n5000_e1_drop0.1
data		data
math_nlp		math_nlp
results		results
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
generate_dataset.py		generate_dataset.py
pyproject.toml		pyproject.toml
run.py		run.py
solution_report.md		solution_report.md
train.py		train.py
train_pretrained.py		train_pretrained.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Technical test - NLP multiplication

The task

Why this task

What we grade (in order)

What's provided

Design space (you decide)

What you do

Quickstart

Data format

Constraints

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Technical test - NLP multiplication

The task

Why this task

What we grade (in order)

What's provided

Design space (you decide)

What you do

Quickstart

Data format

Constraints

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages