C3 · Interview Pack

Transformers and Tokenization Interview Questions

This file prepares deep technical interviews on transformer internals, tokenization behavior, and production tradeoffs.

advanced 30 questions vunknown

Not tracked yet

Jump to section

Scope
How To Use This File
Interviewer Probe Map
Question Clusters
Core Questions
Q1: Walk through scaled dot-product attention
Follow-up: Why divide by sqrt(d_k)?
Q2: Why multi-head attention?
Q3: Decoder-only vs encoder-decoder choice
Q4: Tokenization mismatch bug diagnosis
Q5: Context window vs RAG decision
Q6: Token budget optimization strategy
Q7: Explain causal masking failure impact
Q8: Prefill vs decode bottleneck diagnosis
Q9: Explain KV cache in one production paragraph
Q10: Positional encoding choices
Systems and Production Questions
Q11: Why does latency grow faster than expected at long context?
Q12: How do you benchmark transformer serving safely?
Q13: Quantization rollout strategy for production
Q14: How do you detect context truncation regressions?
Q15: Batch size tuning under strict p95 SLO
Q16: Long prompt injection defense in context-heavy systems
Q17: Which metrics trigger rollback after model swap?
Q18: When should you use smaller specialist model routing?
Q19: How do you isolate model regression from retrieval regression?
Q20: How do you choose model families for multilingual and code-heavy traffic?
Debugging and Optimization Questions
Q21: Repetition loops during generation
Q22: Sudden token cost spike week-over-week
Q23: High p95 with healthy p50
Q24: Model performs well offline but fails in production
Q25: Inconsistent outputs across seemingly same prompts
Q26: Tool-call JSON malformed intermittently
Q27: Attention head visualizations look noisy
Q28: RAG context improves recall but hurts answer quality
Q29: Tokenizer upgrade broke backward compatibility
Q30: Streaming output stalls under load
Rapid-Fire Round
Company Emphasis
References

Jump to question

Q1: Walk through scaled dot-product attention
Q2: Why multi-head attention?
Q3: Decoder-only vs encoder-decoder choice
Q4: Tokenization mismatch bug diagnosis
Q5: Context window vs RAG decision
Q6: Token budget optimization strategy
Q7: Explain causal masking failure impact
Q8: Prefill vs decode bottleneck diagnosis
Q9: Explain KV cache in one production paragraph
Q10: Positional encoding choices
Q11: Why does latency grow faster than expected at long context?
Q12: How do you benchmark transformer serving safely?
Q13: Quantization rollout strategy for production
Q14: How do you detect context truncation regressions?
Q15: Batch size tuning under strict p95 SLO
Q16: Long prompt injection defense in context-heavy systems
Q17: Which metrics trigger rollback after model swap?
Q18: When should you use smaller specialist model routing?
Q19: How do you isolate model regression from retrieval regression?
Q20: How do you choose model families for multilingual and code-heavy traffic?
Q21: Repetition loops during generation
Q22: Sudden token cost spike week-over-week
Q23: High p95 with healthy p50
Q24: Model performs well offline but fails in production
Q25: Inconsistent outputs across seemingly same prompts
Q26: Tool-call JSON malformed intermittently
Q27: Attention head visualizations look noisy
Q28: RAG context improves recall but hurts answer quality
Q29: Tokenizer upgrade broke backward compatibility
Q30: Streaming output stalls under load

Transformers and Tokenization Interview Questions

Scope

This file prepares deep technical interviews on transformer internals, tokenization behavior, and production tradeoffs.

How To Use This File

Start with Foundation cluster.
For top questions, practice all four answer layers:
1. short answer
2. deep answer
3. follow-up ladder
4. anti-pattern answer to avoid
Use linked explainers for concept refresh before mocks.

Interviewer Probe Map

Can you derive attention behavior, not just define it?
Can you connect architecture choices to latency, cost, and reliability?
Can you debug transformer/tokenization failures systematically?
Can you defend tradeoffs for different company environments?

flowchart TD
    A[Question Asked] --> B{Concept or Systems?}
    B -- Concept --> C[Give Short Definition]
    C --> D[Add Equation or Mechanism]
    D --> E[Add Production Implication]
    B -- Systems --> F[State Constraints]
    F --> G[Present Tradeoff Decision]
    G --> H[Add Monitoring and Failure Modes]
    E --> I[Handle Follow-Ups]
    H --> I

Figure: Recommended answer flow during transformer interviews.

Question Clusters

Foundation: Q1 to Q10
Systems and Production: Q11 to Q20
Debugging and Optimization: Q21 to Q30

Core Questions

Q1: Walk through scaled dot-product attention

What interviewer is probing:

Correct conceptual and tensor-level understanding from first principles.

Direct answer: Attention computes token-token relevance scores from queries and keys, normalizes with softmax, and uses these weights to aggregate values.

Deep answer:

Input token states X are projected into Q, K, V.
Score matrix is QK^T.
Scores are scaled by 1/sqrt(d_k) to prevent softmax saturation.
Add mask M (causal or padding).
Apply softmax over key dimension.
Multiply normalized weights with V to produce context-aware outputs.
Multi-head repeats this in parallel subspaces then concatenates and projects.

Follow-up variants:

Level 1: Why is scaling required?
Level 2: Where do masking bugs appear in outputs?
Level 3: How does this affect long-context latency?
Level 4: How do fused kernels change practical performance?

Common mistakes and red flags: "Attention is when the model focuses on important words" without formula path, masking, or production implications.

Common mistakes:

Ignoring causal mask in decoder contexts.
Formula-only answer with no operational meaning.

Follow-up: Why divide by sqrt(d_k)?

What interviewer is probing:

Numerical stability and optimization intuition.

Direct answer: Scaling normalizes logit magnitude so softmax does not saturate as dimensions grow.

Deep answer: If q and k components are roughly independent with unit variance, q.k variance grows with d_k. Large unscaled logits create near one-hot softmax outputs and poor gradient flow. Dividing by sqrt(d_k) keeps logits in a trainable regime and improves convergence stability.

Follow-up variants:

What if scaling is omitted?
How does mixed precision amplify the problem?

Common mistakes and red flags: "The paper does it" with no statistical intuition.

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q2: Why multi-head attention?

What interviewer is probing:

Representation diversity and architecture tradeoff reasoning.

Direct answer: Multiple heads let the model learn different relational patterns in parallel low-dimensional projections.

Deep answer: Single-head attention compresses all interaction patterns into one score map. Multi-head attention decomposes this into parallel subspaces, allowing local, long-range, and structural relationships to co-exist. Practical benefit appears in richer features and improved quality at manageable compute cost.

Follow-up variants:

Why do some heads become redundant?
Would more heads always help?

Common mistakes and red flags: "More heads means more accuracy" without discussing diminishing returns.

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q3: Decoder-only vs encoder-decoder choice

What interviewer is probing:

Architecture tradeoff reasoning.

Direct answer: Decoder-only is simpler and dominant for generative LLM serving; encoder-decoder can still be better for some seq2seq workloads.

Deep answer: Decoder-only models unify prompt and generation in one autoregressive path and align well with next-token pretraining. Encoder-decoder models separate encoding and decoding, useful in translation or tightly structured transformation tasks. In production, decoder-only often wins for ecosystem support and serving simplicity, but architecture should be tied to task shape and latency budget.

Follow-up variants:

Which architecture better reuses encoded context?
How does this impact tool-calling systems?

Common mistakes and red flags: Declaring one architecture universally best.

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q4: Tokenization mismatch bug diagnosis

What interviewer is probing:

Production debugging mindset.

Direct answer: Check tokenizer parity and special-token settings first, then compare train and inference token sequences directly.

Deep answer:

Verify tokenizer version, vocab, merges, and special tokens are identical across training and inference.
Diff tokenized outputs for representative samples.
Validate prompt templates and system-prefix changes.
Confirm truncation and max-length policies.
Re-run eval subset after fixes.

Follow-up variants:

How would mismatch show in metrics?
How do you prevent recurrence in CI?

Common mistakes and red flags: Assuming mismatch is model drift without checking tokenization artifacts.

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q5: Context window vs RAG decision

What interviewer is probing:

Cost/quality tradeoff judgment.

Direct answer: Use larger context when relevant evidence is bounded and stable; use RAG when knowledge is large, dynamic, or private.

Deep answer: Larger context increases prefill cost and may add noise. RAG isolates relevant evidence and improves grounding for changing corpora. Decision factors: document churn, required citation fidelity, latency SLO, and token budget. In most enterprise workloads, RAG plus controlled context beats brute-force long prompts.

Follow-up variants:

How do you detect over-context degradation?
What metrics prove RAG is helping?

Common mistakes and red flags: "Always use the largest context model available."

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q6: Token budget optimization strategy

What interviewer is probing:

Systems-level optimization and prioritization.

Direct answer: Optimize context composition, caching, and retrieval quality before reducing output quality.

Deep answer: Track per-request token cost and break it into system prompt, retrieved context, and generated output. Remove redundant prompt boilerplate, enforce top-k evidence constraints, use prefix caching, and add routing so simple queries avoid expensive paths. Validate quality with regression evals after each change.

Follow-up variants:

Which component usually dominates cost?
How do you keep optimization from hurting faithfulness?

Common mistakes and red flags: Blindly lowering max tokens without quality checks.

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q7: Explain causal masking failure impact

What interviewer is probing:

Data leakage awareness and model correctness.

Direct answer: Incorrect causal masking leaks future information during training and creates unrealistic evaluation performance.

Deep answer: If future tokens leak into current-token predictions, loss decreases artificially and the model learns dependencies unavailable at inference time. This mismatch causes generation instability and rapid quality drop in production.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q8: Prefill vs decode bottleneck diagnosis

What interviewer is probing:

Serving pipeline understanding.

Direct answer: Prompt-heavy requests stress prefill; long generation stresses decode and cache bandwidth.

Deep answer: Measure latency split per phase. High prompt length with moderate output suggests prefill optimization (prompt compression, caching). Long output with modest prompt points to decode efficiency, scheduler policy, and KV-cache pressure.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q9: Explain KV cache in one production paragraph

What interviewer is probing:

Practical inference acceleration reasoning.

Direct answer: KV cache stores prior key/value states so each decode step reuses history instead of recomputing full attention over all previous tokens.

Deep answer: Without cache, decode would repeatedly recompute historical attention states, exploding cost. With cache, each new token computes incremental states and attends over cached history. Tradeoff is memory growth proportional to sequence length, which requires careful cache management and batching policies.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q10: Positional encoding choices

What interviewer is probing:

Long-context behavior reasoning.

Direct answer: Positional strategies inject order information and influence long-context quality and extrapolation behavior.

Deep answer: Absolute learned embeddings are simple but can be less robust beyond trained lengths. Relative and rotary schemes better preserve position relationships under context growth. Choice should reflect target context length, hardware constraints, and model family compatibility.

Systems and Production Questions

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q11: Why does latency grow faster than expected at long context?

What interviewer is probing:

Complexity and serving realism.

Strong answer framework:

Check attention complexity effects and prefill burden.
Check batching and queue delay.
Check cache memory pressure and fallback behavior.

Direct answer: Use a clear, constraint-first decision for why does latency grow faster than expected at long context?, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q12: How do you benchmark transformer serving safely?

What interviewer is probing:

Experiment design discipline.

Strong answer framework:

Fix workload distributions (prompt/output lengths).
Track p50, p95, p99, throughput, and error rate.
Add quality regression checks for each optimization.

Direct answer: Use a clear, constraint-first decision for how do you benchmark transformer serving safely?, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q13: Quantization rollout strategy for production

What interviewer is probing:

Quality-risk management under optimization pressure.

Direct answer: Use a clear, constraint-first decision for quantization rollout strategy for production, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q14: How do you detect context truncation regressions?

What interviewer is probing:

Observability and evaluation design.

Direct answer: Use a clear, constraint-first decision for how do you detect context truncation regressions?, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q15: Batch size tuning under strict p95 SLO

What interviewer is probing:

Throughput vs tail latency tradeoffs.

Direct answer: Use a clear, constraint-first decision for batch size tuning under strict p95 slo, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q16: Long prompt injection defense in context-heavy systems

What interviewer is probing:

Security controls tied to architecture.

Direct answer: Use a clear, constraint-first decision for long prompt injection defense in context-heavy systems, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q17: Which metrics trigger rollback after model swap?

What interviewer is probing:

Production ownership and incident response.

Direct answer: Use a clear, constraint-first decision for which metrics trigger rollback after model swap?, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q18: When should you use smaller specialist model routing?

What interviewer is probing:

Cost/performance optimization judgment.

Direct answer: Use a clear, constraint-first decision for when should you use smaller specialist model routing?, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q19: How do you isolate model regression from retrieval regression?

What interviewer is probing:

Layered debugging methodology.

Direct answer: Use a clear, constraint-first decision for how do you isolate model regression from retrieval regression?, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q20: How do you choose model families for multilingual and code-heavy traffic?

What interviewer is probing:

Tokenization-aware product design.

Debugging and Optimization Questions

Direct answer: Use a clear, constraint-first decision for how do you choose model families for multilingual and code-heavy traffic?, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q21: Repetition loops during generation

What interviewer is probing:

Decoding control and stability debugging.

Direct answer: Use a clear, constraint-first decision for repetition loops during generation, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q22: Sudden token cost spike week-over-week

What interviewer is probing:

Monitoring and change attribution.

Direct answer: Use a clear, constraint-first decision for sudden token cost spike week-over-week, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q23: High p95 with healthy p50

What interviewer is probing:

Tail latency bottleneck reasoning.

Direct answer: Use a clear, constraint-first decision for high p95 with healthy p50, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q24: Model performs well offline but fails in production

What interviewer is probing:

Distribution shift and integration gaps.

Direct answer: Use a clear, constraint-first decision for model performs well offline but fails in production, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q25: Inconsistent outputs across seemingly same prompts

What interviewer is probing:

Hidden variable detection (temperature, templates, truncation).

Direct answer: Use a clear, constraint-first decision for inconsistent outputs across seemingly same prompts, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q26: Tool-call JSON malformed intermittently

What interviewer is probing:

Structured output constraints and parsing resilience.

Direct answer: Use a clear, constraint-first decision for tool-call json malformed intermittently, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q27: Attention head visualizations look noisy

What interviewer is probing:

Interpretability caution and proper conclusions.

Direct answer: Use a clear, constraint-first decision for attention head visualizations look noisy, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q28: RAG context improves recall but hurts answer quality

What interviewer is probing:

Context noise and reranking decisions.

Direct answer: Use a clear, constraint-first decision for rag context improves recall but hurts answer quality, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q29: Tokenizer upgrade broke backward compatibility

What interviewer is probing:

Migration strategy and regression controls.

Direct answer: Use a clear, constraint-first decision for tokenizer upgrade broke backward compatibility, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Q30: Streaming output stalls under load

What interviewer is probing:

Backpressure and scheduler diagnosis.

flowchart TD
    A[Tokenization or Output Bug] --> B{Mismatch in tokenizer config?}
    B -- Yes --> C[Align vocab merges special tokens]
    B -- No --> D{Prompt template drift?}
    D -- Yes --> E[Diff and normalize templates]
    D -- No --> F{Truncation or max length issue?}
    F -- Yes --> G[Fix limits and monitor token histograms]
    F -- No --> H[Check model serving and cache path]

Figure: Tokenization and output debugging decision tree.

flowchart TD
    A[Need More Context] --> B{Knowledge static and bounded?}
    B -- Yes --> C[Use larger context window carefully]
    B -- No --> D[Use RAG pipeline]
    C --> E{Token budget acceptable?}
    E -- No --> D
    E -- Yes --> F[Add eval checks for truncation and noise]
    D --> G[Add retrieval precision and faithfulness evals]

Figure: Context-window versus RAG decision path.

Rapid-Fire Round

Why bigger context can reduce quality in practice.
KV cache role in decode speed.
Three ways to reduce token cost quickly.
One sign your masking is wrong.
Two causes of high p95 but normal p50.

Company Emphasis

Amazon:
- prioritize tradeoff articulation and operational impact.
- emphasize measurable rollout and rollback criteria.
Google:
- expect deeper theory follow-ups and derivation clarity.
- justify architecture choices with stronger first-principles reasoning.
Startup:
- prioritize pragmatic fixes, velocity, and cost discipline.
- show fast diagnosis and iterative improvement loops.
AI labs:
- emphasize frontier reasoning and architecture nuance.
- connect design choices to scaling behavior.

References

attention-and-transformer-internals.md
tokenization-context-window-and-cost.md
Attention Is All You Need: https://arxiv.org/abs/1706.03762
vLLM docs: https://docs.vllm.ai/en/latest/
OpenAI eval guidance: https://developers.openai.com/api/docs/guides/evals

Direct answer: Use a clear, constraint-first decision for streaming output stalls under load, then state one production tradeoff (latency, cost, or reliability).

Deep answer:

State assumptions, constraints, and success metric.
Explain the chosen design or algorithm and why alternatives are weaker.
Cover failure handling, observability, and rollback criteria.

Common mistakes and red flags:

Naming tools or algorithms without mapping them to constraints.
Ignoring edge cases, failure modes, or rollback triggers.
Skipping metrics needed to prove the design works in production.

Follow-up variants:

What changes if throughput doubles or latency budget is cut in half?
Which single metric would trigger rollback after deployment?

Sample code or pseudocode (when relevant):

# Interview outline
1) Validate inputs and constraints
2) Apply core strategy
3) Add failure handling and observability hooks

Related Explainers

Use these explainers to reinforce weak interview topics.

Transformers and Tokenization Interview Questions

Transformers and Tokenization Interview Questions

Scope

How To Use This File

Interviewer Probe Map

Question Clusters

Core Questions

Q1: Walk through scaled dot-product attention

Follow-up: Why divide by sqrt(d_k)?

Q2: Why multi-head attention?

Q3: Decoder-only vs encoder-decoder choice

Q4: Tokenization mismatch bug diagnosis

Q5: Context window vs RAG decision

Q6: Token budget optimization strategy

Q7: Explain causal masking failure impact

Q8: Prefill vs decode bottleneck diagnosis

Q9: Explain KV cache in one production paragraph

Q10: Positional encoding choices

Systems and Production Questions

Q11: Why does latency grow faster than expected at long context?

Q12: How do you benchmark transformer serving safely?

Q13: Quantization rollout strategy for production

Q14: How do you detect context truncation regressions?

Q15: Batch size tuning under strict p95 SLO

Q16: Long prompt injection defense in context-heavy systems

Q17: Which metrics trigger rollback after model swap?

Q18: When should you use smaller specialist model routing?

Q19: How do you isolate model regression from retrieval regression?

Q20: How do you choose model families for multilingual and code-heavy traffic?

Debugging and Optimization Questions

Q21: Repetition loops during generation

Q22: Sudden token cost spike week-over-week

Q23: High p95 with healthy p50

Q24: Model performs well offline but fails in production

Q25: Inconsistent outputs across seemingly same prompts

Q26: Tool-call JSON malformed intermittently

Q27: Attention head visualizations look noisy

Q28: RAG context improves recall but hurts answer quality

Q29: Tokenizer upgrade broke backward compatibility

Q30: Streaming output stalls under load

Rapid-Fire Round

Company Emphasis

References

Related Explainers

Attention and Transformer Internals

Tokenization, Context Window, and Cost Engineering