Lecture 2: From Language Models to Intelligent Agents

LLMs & Agents

Understanding how large language models work, and how they become autonomous agents capable of reasoning, planning, and taking actions.

Nir Naim
Tel Aviv University
Queueing Theory Seminar

What are Large Language Models?

A Large Language Model (LLM) is a neural network trained to understand and generate human language. Built on the Transformer architecture (covered in the previous lecture), LLMs learn patterns from vast amounts of text data.

Core Principle: Next-Token Prediction

At its heart, an LLM is trained to predict the next token given all previous tokens. This simple objective, when scaled to billions of parameters and trillions of tokens, produces remarkably capable models.

\[P(x_{t+1} | x_1, x_2, \ldots, x_t) = \text{softmax}(W_o \cdot \text{Transformer}(x_1, \ldots, x_t))\]

Decoder-Only Architecture

Modern LLMs like GPT-4, Claude, and Llama use a decoder-only architecture. Unlike the original encoder-decoder Transformer, they process input and generate output in a single unified model using causal (masked) attention.

Key Characteristics

  • Autoregressive: Generates one token at a time, feeding each output back as input
  • Causal masking: Each position can only attend to previous positions
  • Unified representation: Same model handles both "understanding" and "generation"
  • Context window: Fixed maximum sequence length (e.g., 8K, 128K, 1M+ tokens)

Scale Matters

The "Large" in LLM refers to both model size and training data. GPT-3 has 175 billion parameters trained on ~500 billion tokens. GPT-4 is estimated to be even larger.

๐Ÿ’ก Key Insight

Scaling laws show that model performance improves predictably with compute, data, and parameters. This empirical finding drove the race to build ever-larger models.

Training LLMs

Training an LLM is a multi-stage process, from unsupervised pretraining on web-scale data to supervised fine-tuning for specific capabilities.

Stage 1: Pretraining

The model learns language patterns by predicting the next token on massive text corpora:

\[\mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}; \theta)\]

Training Data Sources

  • Common Crawl: Petabytes of web pages
  • Books: Literature, textbooks, technical manuals
  • Wikipedia: Encyclopedic knowledge
  • Code: GitHub repositories, documentation
  • Scientific papers: arXiv, PubMed, etc.

Stage 2: Supervised Fine-Tuning (SFT)

After pretraining, models are fine-tuned on curated datasets of high-quality examples demonstrating desired behaviors:

Example
# Instruction-following example
{
  "instruction": "Explain quantum entanglement simply",
  "response": "Quantum entanglement is when two particles become connected in such a way that measuring one instantly affects the other, no matter how far apart they are..."
}

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

To align model outputs with human preferences, RLHF trains a reward model on human comparisons, then optimizes the LLM to maximize this reward:

\[\mathcal{L}_{\text{RLHF}} = \mathbb{E}_{x \sim D, y \sim \pi_\theta}[R(x, y)] - \beta \cdot D_{\text{KL}}(\pi_\theta || \pi_{\text{ref}})\]

๐Ÿ’ก Why RLHF?

RLHF helps models be helpful, harmless, and honest. It teaches models to refuse harmful requests, admit uncertainty, and follow complex instructions.

Inference: Generating Text

At inference time, the model generates text token-by-token. Several parameters control this process:

Temperature

Temperature controls the "randomness" of generation by scaling the logits before softmax:

\[P(x_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}\]

Where \(T\) is temperature. Low temperature (โ†’0) makes output deterministic; high temperature (โ†’โˆž) makes it uniform random.

Interactive: Temperature Effect

Adjust temperature to see how it affects token probabilities for "The capital of France is ___"

Deterministic Random T = 1.0
Paris
72%
Lyon
12%
Marseille
8%
Nice
5%
Bordeaux
3%

Sampling Strategies

Common Methods

  • Greedy: Always pick the highest probability token
  • Top-k: Sample from the k most likely tokens
  • Top-p (nucleus): Sample from tokens whose cumulative probability exceeds p
  • Beam search: Maintain multiple candidate sequences

Context Window

The context window is the maximum number of tokens the model can process. Attention is O(nยฒ), so longer contexts are expensive. Modern techniques like sliding window attention, sparse attention, and RoPE scaling extend context lengths.

Prompting Techniques

Prompting is the art of instructing LLMs through natural language. Different techniques unlock different capabilities.

Classify the sentiment of this review as positive, negative, or neutral:

"The food was amazing but the service was slow."

Sentiment:

Zero-shot: No examples provided. The model relies entirely on its pretraining knowledge.

Classify the sentiment of reviews:

Review: "Best purchase I ever made!"
Sentiment: positive

Review: "Terrible quality, broke after one day."
Sentiment: negative

Review: "The food was amazing but the service was slow."
Sentiment:

Few-shot: Provide examples that demonstrate the pattern. The model learns from context.

Classify the sentiment of this review. Think step by step:

Review: "The food was amazing but the service was slow."

Let me analyze this:
1. "food was amazing" - this is positive
2. "service was slow" - this is negative
3. Mixed signals, but "amazing" is strong positive
4. Overall leaning positive with a caveat

Sentiment: positive (mixed)

Chain-of-Thought: Encourage the model to reason step-by-step before answering.

Why Chain-of-Thought Works

CoT prompting improves performance on reasoning tasks by:

๐Ÿ’ก Emergent Ability

Chain-of-thought reasoning is an emergent abilityโ€”it only works reliably in sufficiently large models. Smaller models may produce incoherent chains.

Scaling Laws

Empirical research has revealed predictable relationships between model performance and three key factors:

\[L(N, D, C) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + \left(\frac{C_c}{C}\right)^{\alpha_C}\]

Where \(N\) = parameters, \(D\) = dataset size, \(C\) = compute, and \(\alpha \approx 0.076\) for parameters.

Key Findings (Kaplan et al., 2020)

  • Performance scales as a power law with compute, data, and parameters
  • Larger models are more sample efficient
  • Optimal allocation: scale model size faster than dataset size
  • No signs of diminishing returns at current scales

Emergent Abilities

Some capabilities appear suddenly at certain scales. Examples include:

What are Agents?

An LLM Agent is a system that uses a language model as its core reasoning engine, combined with the ability to take actions, use tools, and maintain memory across interactions.

Definition: LLM Agent

An LLM Agent = LLM (reasoning) + Tools (actions) + Memory (state) + Loop (orchestration)

LLM Reasoning Engine User Input Response Tools (Search, Code, API) Memory / Context Loop

The Agent Loop

Agents operate in a continuous loop:

  1. Perceive: Receive input from user or environment
  2. Think: LLM reasons about what to do next
  3. Act: Execute a tool or generate a response
  4. Observe: Process the result of the action
  5. Repeat: Until task is complete

๐Ÿ’ก Agents vs. Chatbots

A chatbot generates text responses. An agent can take actions in the world: search the web, execute code, call APIs, modify files, and more.

Tool Use & Function Calling

Modern LLMs can be taught to use external tools through function calling. The model outputs structured requests that are executed by external code.

How Function Calling Works

Python
# Define available tools
tools = [
    {
        "name": "search_web",
        "description": "Search the web for information",
        "parameters": {
            "query": {"type": "string", "description": "Search query"}
        }
    },
    {
        "name": "calculate",
        "description": "Perform mathematical calculations",
        "parameters": {
            "expression": {"type": "string", "description": "Math expression"}
        }
    }
]

# LLM decides to call a tool
response = llm.chat(
    messages=[{"role": "user", "content": "What's 15% of 847?"}],
    tools=tools
)

# Output: {"name": "calculate", "arguments": {"expression": "847 * 0.15"}}

Common Tool Types

๐Ÿ”

Web Search

Query search engines for real-time information

๐Ÿงฎ

Calculator

Precise mathematical computations

๐Ÿ’ป

Code Execution

Run Python/JavaScript in sandboxed environments

๐Ÿ“

File Operations

Read, write, and manipulate files

๐ŸŒ

API Calls

Interact with external services

๐Ÿ—„๏ธ

Database

Query and update databases

Structured Outputs

Beyond tool calls, LLMs can output any structured format (JSON, XML, etc.). This enables reliable parsing and integration with downstream systems.

Python (Pydantic)
from pydantic import BaseModel

class MovieReview(BaseModel):
    title: str
    rating: float
    sentiment: Literal["positive", "negative", "neutral"]
    summary: str

# LLM outputs validated, typed data
review = llm.generate(MovieReview, prompt="Review: The Matrix...")

Multi-Agent Systems

Complex tasks can be decomposed across multiple specialized agents that collaborate, debate, or supervise each other.

Common Patterns

Hierarchical

A "manager" agent delegates subtasks to specialist agents and synthesizes results.

Collaborative

Peer agents work together, sharing information and building on each other's outputs.

Adversarial / Debate

Agents argue different positions; a judge synthesizes the best answer.

Example: Research Team

Agent Frameworks

Several frameworks simplify building LLM agents. Each has different design philosophies and trade-offs.

Framework Focus Key Feature
LangChain General purpose Extensive integrations, chains
PydanticAI Type safety Pydantic-based structured outputs
CrewAI Multi-agent Role-based agent teams
AutoGen Conversations Multi-agent chat orchestration
LlamaIndex RAG / Data Document indexing & retrieval

Why PydanticAI?

For our workshop, we'll use PydanticAI because:

Python
from pydantic_ai import Agent

agent = Agent(
    'gemini-2.5-flash',  # Free tier!
    system_prompt="You are a helpful research assistant."
)

@agent.tool
def search_web(query: str) -> str:
    """Search the web for information."""
    return perform_search(query)

result = agent.run_sync("What's the latest news on AI?")
print(result.output)

Ready to Build Your Own Agent?

Join the hands-on workshop where we'll build a research assistant agent step-by-step using PydanticAI.

Start Workshop

Further Reading

Papers

Resources