Programming AI, not prompting it

I spent an embarrassing amount of time last year tweaking prompt strings. Not five minutes here and there, hours. "Maybe if I add one more adjective," I'd think. "Perhaps reordering these instructions will help." I'd test with a few examples, notice the 2% improvement, and celebrate like I'd solved climate change. It was inefficient and brittle. It was the opposite of how I actually build software (or liked building).

I discovered DSPy, and it felt like someone finally said the quiet part out loud: we've been doing this backward.

Here's what drives me crazy about "prompt engineering": it's not engineering. It's configuration. You take a text file, add emojis, paste in examples, adjust the temperature, and hope. When something breaks, you crack open that same text file and hope harder. Your "AI system" is now a collection of string templates with implicit dependencies. Good luck maintaining that in production. You optimize prompts for one model, then Claude 4.5 comes out and half your work evaporates. You need a new model for faster inference? Rewrite everything. You want to compose multiple steps together? Now you're playing Jenga with interdependent prompts.

DSPy stops this madness. It says: stop tweaking strings, start writing code.

What DSPy Actually Does

DSPy is a framework for programming language models instead of prompting them. That distinction matters more than it sounds.

In traditional development, you'd write: result = function(input). The function is deterministic, repeatable, testable. With DSPy, you write code that looks exactly like that, but the function itself is powered by an LLM. More importantly, you define what the function should do (a signature), what technique to use (a module), and then let the framework optimize it automatically.

Think of it like this: you describe the problem structure, compose modules to solve it, and then the system tunes the prompts and examples for you. No manual tweaking required. It's declarative, you declare what you want, not how to prompt for it.

The framework has three core pieces:

Signatures: A declaration of inputs and outputs. "Given research feedback, extract three key insights."
Modules: Techniques for solving problems. Chain-of-thought, ReAct, multi-hop reasoning—all composable.
Optimizers: Algorithms that automatically improve your modules by adjusting instructions, generating few shot examples, or even finetuning weights.

A practical example: Building a multi-stage analysis pipeline

Let me show you something concrete. Imagine you're building a system that processes customer research feedback and extracts actionable insights. This is something I hadn't done before, and it's a perfect DSPy use case because it naturally breaks into multiple stages.

I a traditional approach you'd write a monster prompt that tries to do everything at once. In DSPy you build modules, compose them, and let the optimizer handle the rest.

import dspy
from dspy import ChainOfThought, Predict

# Set up your LM (Claude, GPT-4, whatever)
lm = dspy.Claude(model="claude-opus-4-5-20251101")
dspy.settings.configure(lm=lm)

# Define signatures for each stage
class ExtractThemes(dspy.Signature):
    """Extract themes from customer feedback"""
    feedback: str = dspy.InputField(desc="Raw customer feedback")
    themes: str = dspy.OutputField(desc="Key themes found in feedback, as bullet points")

class PrioritizeThemes(dspy.Signature):
    """Rank themes by business impact"""
    themes: str = dspy.InputField(desc="Themes extracted from feedback")
    priority_ranking: str = dspy.OutputField(
        desc="Ranked themes with reasoning, highest impact first"
    )

class GenerateActions(dspy.Signature):
    """Create actionable recommendations"""
    priority_ranking: str = dspy.InputField(desc="Prioritized themes")
    actions: str = dspy.OutputField(desc="Specific, actionable recommendations")

# Compose them into a pipeline
class FeedbackAnalyzer(dspy.Module):
    def __init__(self):
        super().__init__()
        self.extract = ChainOfThought(ExtractThemes)
        self.prioritize = ChainOfThought(PrioritizeThemes)
        self.recommend = ChainOfThought(GenerateActions)

    def forward(self, feedback):
        themes = self.extract(feedback=feedback).themes
        ranking = self.prioritize(themes=themes).priority_ranking
        actions = self.recommend(priority_ranking=ranking).actions
        return dspy.Prediction(
            themes=themes,
            ranking=ranking,
            actions=actions
        )

# Now compile/optimize it
analyzer = FeedbackAnalyzer()

# Define your metric
def metric_fn(example, pred, trace=None):
    # You decide what "good" looks like
    return len(pred.actions.split('\n')) >= 3  # At least 3 actions

# Optimize automatically
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(metric=metric_fn)
optimized_analyzer = optimizer.compile(analyzer, trainset=training_examples)

# Use it
result = optimized_analyzer(feedback="Users hate that features are buried in menus...")
print(result)

What just happened?

You defined structure, not prompts. The signatures tell DSPy what the inputs and outputs are—clean, declarative, self-documenting.
You composed multiple steps. Each module does one thing well. The pipeline is readable. Testable. You could swap out modules independently.
The system optimized itself. You gave it a metric (what "good" means) and training examples, and the optimizer tuned the internal prompts and few-shot examples automatically.
It stays portable. If you need to switch from Claude to another model? Update one line. The module structure doesn't change. The signatures don't change.

VS LangChain

LangChain is a mature framework that does something superficially similar. If I were building the same pipeline in LangChain, it would look like this:

from langchain.prompts import PromptTemplate, PipelinePromptTemplate
from langchain_anthropic import ChatAnthropic

# Initialize the model
model = ChatAnthropic(model="claude-opus-4-5-20251101")

# Define individual prompts
extract_prompt = PromptTemplate(
    input_variables=["feedback"],
    template="""Analyze this customer feedback and extract key themes.

Feedback: {feedback}

Extract themes as bullet points."""
)

prioritize_prompt = PromptTemplate(
    input_variables=["themes"],
    template="""Given these themes, rank them by business impact.

Themes: {themes}

Provide ranked themes with reasoning."""
)

actions_prompt = PromptTemplate(
    input_variables=["priority_ranking"],
    template="""Based on these prioritized themes, create actionable recommendations.

Themes: {priority_ranking}

Generate specific, actionable recommendations."""
)

# Chain them together manually
def analyze_feedback(feedback):
    # Step 1: Extract themes
    themes_result = model.invoke(extract_prompt.format(feedback=feedback))
    themes = themes_result.content

    # Step 2: Prioritize
    priority_result = model.invoke(prioritize_prompt.format(themes=themes))
    priority = priority_result.content

    # Step 3: Generate actions
    actions_result = model.invoke(actions_prompt.format(priority_ranking=priority))

    return {
        "themes": themes,
        "priority": priority,
        "actions": actions_result.content
    }

# Use it
result = analyze_feedback("Users hate that features are buried in menus...")
print(result)

LangChain's strength: It's excellent for orchestrating complex workflows with tools, agents, and memory. It's mature, battle tested, and solves real production problems. If you need to build a system with RAG, tool calling, and agent loops, LangChain is the pragmatic choice.

DSPy's strength: It solves the prompt engineering problem at its core. You don't optimize manually, the system does it for you. This is powerful if you care about maintainability, portability, and not wasting time tweaking strings.

The key difference: LangChain assumes you'll write good prompts and compose them well. DSPy assumes you'll define the problem structure and let the system optimize. One is about better orchestration. The other is about removing orchestration entirely.

In the LangChain version, when the model's performance drifts, I'm back to tweaking template strings. In the DSPy version, I define a metric and let the optimizer run. That's the fundamental philosophical difference.

I've been building software for a long time. I've seen frameworks and tools come and go. What I appreciate is when something respects the principles of good engineering: modularity, composability, testability, maintainability.

DSPy does that. It treats AI systems like actual systems, not like configuration files.

Is DSPy a replacement for LangChain? In my opinion, it is not. They solve different problems. DSPy represents a genuinely different way of thinking about LLM applications. And after a year of tweaking prompt strings, that felt pretty nice.