Skip to content

mxz2013/NLP_MATH

Repository files navigation

Technical test - NLP multiplication

Time-boxed: 4–6 hours.

This readme is for explaining the interview request and objectives. Please find my solution report in solution_report.md.

The task

Use a language model (≤ 1B parameters) to solve 4-digit by 4-digit multiplication:

1222x3399=4153578

How you get there is up to you - train from scratch, fine-tune a pretrained checkpoint, prompt a frozen one, or anything in between.

Why this task

A parser plus a calculator solves multiplication perfectly in one line. We know. This is a deliberately simple-looking - but not actually simple - problem chosen as a controlled probe of how well you understand language-modeling and its limits.

What we grade (in order)

  1. Package quality - readability, structure, reproducibility.
  2. Depth of understanding - can you walk us through every design choice and its trade-offs without hand-waving?
  3. Honest analysis of what your model actually does and where it fails.
  4. Metric performance - last.

What's provided

  • generate_dataset.py - produces train.jsonl / test.jsonl of multiplication examples (real, working).
  • run.py - runnable inference + evaluation pipeline; supports --data and --prompt.
  • math_nlp/ - package skeleton with three replaceable hooks:
    • transform.py - Transform placeholder (identity), an optional text-level transformation between the canonical example and the form your model consumes / produces.
    • tokenizer.py - Tokenizer placeholder (identity).
    • model.py - Model placeholder (random integer).
  • data.py and evaluate.py are real, working pieces (sampling, formatting, JSONL I/O, accuracy).

The placeholders are deliberate. The pipeline runs end-to-end with them so you only need to swap in real components.

Design space (you decide)

There is no single right answer along any of these axes. The interesting work is choosing positions and explaining the trade-offs.

  • Modeling. From-scratch decoder, fine-tuned pretrained checkpoint, pure prompting / in-context learning of a frozen model, …?
  • Tokenization. Off-the-shelf tokenizer, hand-rolled vocabulary, what level of granularity?
  • Data transformation. Does the canonical text format make the task easier or harder for the model than it has to be?

Whatever you pick along each axis, we will ask you why.

What you do

  1. How do humans do multiplication?
  2. Generate data; decide the size and distribution; justify.
  3. Pick and implement a modeling strategy.
  4. Pick and implement a tokenizer.
  5. Decide whether (and how) to transform the data.
  6. Wire everything into run.py so that python run.py --data data/test.jsonl reports accuracy (is that the right metric?) and python run.py --prompt AxB returns a prediction.
  7. Write a short notebook: your decisions and why, what you observed, where the model fails, what you would do with another day.

Quickstart

python generate_dataset.py --n 10000 --test-frac 0.1 --seed 0
python run.py --data data/test.jsonl
python run.py --prompt 3344x1119

With the placeholders in place, evaluation accuracy is ~0 and --prompt returns a random integer. That is expected.

Data format

JSON Lines, one example per line:

{"prompt": "1222x3399=", "completion": "4153578"}

Operands are sampled uniformly in [1000, 9999]; the product is a natural number. Train and test are guaranteed disjoint (sampling is without replacement across the full dataset).

Constraints

  • Autoregressive language model, ≤ 1B parameters. Tokenization and text format are yours to choose.

If anything in the brief is ambiguous, ask the interviewer.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages