LLMs & Agents: A Comprehensive Guide

Section 01

What are Large Language Models?

A Large Language Model (LLM) is a neural network trained to understand and generate human language. Built on the Transformer architecture (covered in the previous lecture), LLMs learn patterns from vast amounts of text data.

Core Principle: Next-Token Prediction

At its heart, an LLM is trained to predict the next token given all previous tokens. This simple objective, when scaled to billions of parameters and trillions of tokens, produces remarkably capable models.

P(x_{t+1} | x_1, x_2, \ldots, x_t) = \text{softmax}(W_o \cdot \text{Transformer}(x_1, \ldots, x_t))

Decoder-Only Architecture

Modern LLMs like GPT-4, Claude, and Llama use a decoder-only architecture. Unlike the original encoder-decoder Transformer, they process input and generate output in a single unified model using causal (masked) attention.

Key Characteristics

Autoregressive: Generates one token at a time, feeding each output back as input
Causal masking: Each position can only attend to previous positions
Unified representation: Same model handles both "understanding" and "generation"
Context window: Fixed maximum sequence length (e.g., 8K, 128K, 1M+ tokens)

Scale Matters

The "Large" in LLM refers to both model size and training data. GPT-3 has 175 billion parameters trained on ~500 billion tokens. GPT-4 is estimated to be even larger.

💡 Key Insight

Scaling laws show that model performance improves predictably with compute, data, and parameters. This empirical finding drove the race to build ever-larger models.

Section 02

Training LLMs

Training an LLM is a multi-stage process, from unsupervised pretraining on web-scale data to supervised fine-tuning for specific capabilities.

Stage 1: Pretraining

The model learns language patterns by predicting the next token on massive text corpora:

\mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}; \theta)

Training Data Sources

Common Crawl: Petabytes of web pages
Books: Literature, textbooks, technical manuals
Wikipedia: Encyclopedic knowledge
Code: GitHub repositories, documentation
Scientific papers: arXiv, PubMed, etc.

Stage 2: Supervised Fine-Tuning (SFT)

After pretraining, models are fine-tuned on curated datasets of high-quality examples demonstrating desired behaviors:

Example

# Instruction-following example
{
  "instruction": "Explain quantum entanglement simply",
  "response": "Quantum entanglement is when two particles become connected in such a way that measuring one instantly affects the other, no matter how far apart they are..."
}

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

To align model outputs with human preferences, RLHF trains a reward model on human comparisons, then optimizes the LLM to maximize this reward:

\mathcal{L}_{\text{RLHF}} = \mathbb{E}_{x \sim D, y \sim \pi_\theta}[R(x, y)] - \beta \cdot D_{\text{KL}}(\pi_\theta || \pi_{\text{ref}})

💡 Why RLHF?

RLHF helps models be helpful, harmless, and honest. It teaches models to refuse harmful requests, admit uncertainty, and follow complex instructions.

Section 03

Inference: Generating Text

At inference time, the model generates text token-by-token. Several parameters control this process:

Temperature

Temperature controls the "randomness" of generation by scaling the logits before softmax:

P(x_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

Where \(T\) is temperature. Low temperature (→0) makes output deterministic; high temperature (→∞) makes it uniform random.

Interactive: Temperature Effect

Adjust temperature to see how it affects token probabilities for "The capital of France is ___"

Deterministic Random T = 1.0

Paris

72%

Lyon

12%

Marseille

Nice

Bordeaux

Sampling Strategies

Common Methods

Greedy: Always pick the highest probability token
Top-k: Sample from the k most likely tokens
Top-p (nucleus): Sample from tokens whose cumulative probability exceeds p
Beam search: Maintain multiple candidate sequences

Context Window

The context window is the maximum number of tokens the model can process. Attention is O(n²), so longer contexts are expensive. Modern techniques like sliding window attention, sparse attention, and RoPE scaling extend context lengths.

Section 04

Prompting Techniques

Prompting is the art of instructing LLMs through natural language. Different techniques unlock different capabilities.

Classify the sentiment of this review as positive, negative, or neutral:

"The food was amazing but the service was slow."

Sentiment:

Zero-shot: No examples provided. The model relies entirely on its pretraining knowledge.

Classify the sentiment of reviews:

Review: "Best purchase I ever made!"
Sentiment: positive

Review: "Terrible quality, broke after one day."
Sentiment: negative

Review: "The food was amazing but the service was slow."
Sentiment:

Few-shot: Provide examples that demonstrate the pattern. The model learns from context.

Classify the sentiment of this review. Think step by step:

Review: "The food was amazing but the service was slow."

Let me analyze this:
1. "food was amazing" - this is positive
2. "service was slow" - this is negative
3. Mixed signals, but "amazing" is strong positive
4. Overall leaning positive with a caveat

Sentiment: positive (mixed)

Chain-of-Thought: Encourage the model to reason step-by-step before answering.

Why Chain-of-Thought Works

CoT prompting improves performance on reasoning tasks by:

Breaking complex problems into manageable steps
Making intermediate reasoning explicit (auditable)
Reducing compounding errors in multi-step reasoning
Leveraging the model's ability to follow demonstrated patterns

💡 Emergent Ability

Chain-of-thought reasoning is an emergent ability—it only works reliably in sufficiently large models. Smaller models may produce incoherent chains.

Section 05

Scaling Laws

Empirical research has revealed predictable relationships between model performance and three key factors:

L(N, D, C) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + \left(\frac{C_c}{C}\right)^{\alpha_C}

Where \(N\) = parameters, \(D\) = dataset size, \(C\) = compute, and \(\alpha \approx 0.076\) for parameters.

Key Findings (Kaplan et al., 2020)

Performance scales as a power law with compute, data, and parameters
Larger models are more sample efficient
Optimal allocation: scale model size faster than dataset size
No signs of diminishing returns at current scales

Emergent Abilities

Some capabilities appear suddenly at certain scales. Examples include:

Multi-step arithmetic
Word unscrambling
Following complex instructions
In-context learning from few examples

Section 06

What are Agents?

An LLM Agent is a system that uses a language model as its core reasoning engine, combined with the ability to take actions, use tools, and maintain memory across interactions.

Definition: LLM Agent

An LLM Agent = LLM (reasoning) + Tools (actions) + Memory (state) + Loop (orchestration)

The Agent Loop

Agents operate in a continuous loop:

Perceive: Receive input from user or environment
Think: LLM reasons about what to do next
Act: Execute a tool or generate a response
Observe: Process the result of the action
Repeat: Until task is complete

💡 Agents vs. Chatbots

A chatbot generates text responses. An agent can take actions in the world: search the web, execute code, call APIs, modify files, and more.

Section 07

Tool Use & Function Calling

Modern LLMs can be taught to use external tools through function calling. The model outputs structured requests that are executed by external code.

How Function Calling Works

Python

# Define available tools
tools = [
    {
        "name": "search_web",
        "description": "Search the web for information",
        "parameters": {
            "query": {"type": "string", "description": "Search query"}
        }
    },
    {
        "name": "calculate",
        "description": "Perform mathematical calculations",
        "parameters": {
            "expression": {"type": "string", "description": "Math expression"}
        }
    }
]

# LLM decides to call a tool
response = llm.chat(
    messages=[{"role": "user", "content": "What's 15% of 847?"}],
    tools=tools
)

# Output: {"name": "calculate", "arguments": {"expression": "847 * 0.15"}}

Common Tool Types

🔍

Web Search

Query search engines for real-time information

🧮

Calculator

Precise mathematical computations

💻

Code Execution

Run Python/JavaScript in sandboxed environments

📁

File Operations

Read, write, and manipulate files

🌐

API Calls

Interact with external services

🗄️

Database

Query and update databases

Structured Outputs

Beyond tool calls, LLMs can output any structured format (JSON, XML, etc.). This enables reliable parsing and integration with downstream systems.

Python (Pydantic)

from pydantic import BaseModel

class MovieReview(BaseModel):
    title: str
    rating: float
    sentiment: Literal["positive", "negative", "neutral"]
    summary: str

# LLM outputs validated, typed data
review = llm.generate(MovieReview, prompt="Review: The Matrix...")

Section 08

Multi-Agent Systems

Complex tasks can be decomposed across multiple specialized agents that collaborate, debate, or supervise each other.

Common Patterns

Hierarchical

A "manager" agent delegates subtasks to specialist agents and synthesizes results.

Collaborative

Peer agents work together, sharing information and building on each other's outputs.

Adversarial / Debate

Agents argue different positions; a judge synthesizes the best answer.

Example: Research Team

Researcher Agent: Searches for information, reads papers
Analyst Agent: Synthesizes findings, identifies patterns
Writer Agent: Produces the final report
Critic Agent: Reviews for accuracy and clarity

Section 09

Agent Frameworks

Several frameworks simplify building LLM agents. Each has different design philosophies and trade-offs.

Framework	Focus	Key Feature
LangChain	General purpose	Extensive integrations, chains
PydanticAI	Type safety	Pydantic-based structured outputs
CrewAI	Multi-agent	Role-based agent teams
AutoGen	Conversations	Multi-agent chat orchestration
LlamaIndex	RAG / Data	Document indexing & retrieval

Why PydanticAI?

For our workshop, we'll use PydanticAI because:

Clean, Pythonic API with type hints
Built-in structured output validation
Easy tool definition with decorators
Provider-agnostic (OpenAI, Anthropic, etc.)
Lightweight and easy to understand