LLM学习框架

1. LLM工作原理

《Build a Large Language Model (From Scratch)》通过一步步亲手编写代码，深入理解并掌握大语言模型（LLM）的内部运作机制。

Time to First Token (TTFT): This measures the responsiveness of the system, defined as the time elapsed from when a user's request arrives to when the first output token is generated. It is primarily a function of the scheduling delay (time spent waiting in a queue) and the prefill phase latency. Low TTFT is essential for interactive applications like chatbots to feel responsive.
Time Per Output Token (TPOT) / Inter-Token Latency (ITL): This measures the speed of generation after the first token, defined as the average time to generate each subsequent token. It is determined by the decode phase latency. TPOT dictates the fluency of the output stream, which is critical for applications generating long responses.
Total Latency: This is the total time required to generate a complete response for a single user. It can be calculated as: Latency = TTFT + (TPOT × (number of output tokens − 1)).
Throughput: This measures the overall capacity and efficiency of the inference server. It is typically defined as the total number of output tokens generated per second across all concurrent users and requests. Maximizing throughput is the key to lowering the operational cost per token served.

Last modified: 22 November 2025