yt-search: Semantic Search Over YouTube Transcripts • Daniel Herman

The Problem

YouTube videos contain enormous amounts of knowledge, but that knowledge is locked behind a linear timeline. You can’t search inside a 3-hour podcast for the one segment where the guest explains their approach to distributed systems. YouTube’s built-in search is title-and-description only - it doesn’t search what was actually said.

For AI agents, this is even worse. An agent can’t watch a video. It needs structured, queryable access to video content to do anything useful with it.

The Solution

yt-search is a CLI tool that downloads YouTube subtitles, builds a local search index, and lets you semantically query video transcripts. It’s designed to be called by AI agents (like Claude Code) that handle reasoning - the tool handles retrieval only.

# Build an index from one or more videos
yt-search download "https://youtube.com/watch?v=..."
# -> a3f2c1b4

# Query with natural language
yt-search query a3f2c1b4 "what does he say about attention mechanisms?"

Results come back as JSON with the matching text, video title, timestamp, and relevance score - so the agent (or you) can jump straight to the right moment.

How It Works

The retrieval pipeline combines two complementary search strategies, merges them, and reranks for precision:

Hybrid retrieval pipeline: dense + sparse search fused via RRF, then reranked by a cross-encoder.

Dense retrieval - video chunks are embedded with nomic-embed-text-v1.5 into a FAISS index for semantic similarity search
Sparse retrieval - BM25 handles exact keyword matches that embeddings sometimes miss
Fusion - results from both are merged using Reciprocal Rank Fusion (RRF, K=60)
Reranking - top-20 candidates are rescored by a cross-encoder (ms-marco-MiniLM-L-6-v2) for final precision

This hybrid approach catches both semantic meaning (“how does backpropagation work”) and specific terms (“Adam optimizer learning rate 3e-4”) that pure embedding search would miss.

Sessions

Sessions are content-addressed by video IDs - the same set of URLs always reuses the same index, even across restarts. No duplicate work, 24-hour TTL, everything cached locally.

Rate Limiting

A circuit breaker tracks consecutive download failures. After 3 consecutive 429 errors from YouTube, all downloads pause for 120 seconds. Individual videos retry with exponential backoff. This keeps the tool usable without hammering YouTube’s servers.

Why It’s Useful

The main use case is AI-assisted research. Once installed as a Claude Code skill, you can ask Claude to research YouTube content and it will:

Download and index the videos
Query iteratively with different phrasings
Synthesize an answer from the returned chunks with timestamps as citations

This turns hours of video into searchable, citable knowledge in seconds.

How Claude Code uses yt-search: ask a question, the agent iteratively queries the index with different phrasings, then synthesizes an answer with timestamped citations.

Real Example

Here’s what a real session looks like. I indexed Karpathy’s 1-hour “Intro to Large Language Models” talk and ran three queries against it.

Step 1: Download and index

yt-search download "https://www.youtube.com/watch?v=zjkBMFhNj_g"
# -> 156b4111

The transcript is downloaded via yt-dlp, chunked into ~400-word segments, and indexed into both FAISS and BM25 stores. Takes about 2 seconds.

Step 2: Query - “What does Karpathy say about scaling laws?”

yt-search query 156b4111 "What does Karpathy say about scaling laws?"

Top result - timestamp [25:33], score -2.41:

“…the first very important thing to understand about the large language model space are what we call scaling laws. It turns out that the performance of these large language models in terms of the accuracy of the next word prediction task is a remarkably smooth, well-behaved and predictable function of only two variables. You need to know N, the number of parameters in the network, and D, the amount of text that you’re going to train on. Given only these two numbers we can predict with remarkable confidence what accuracy you’re going to achieve on your next word prediction task. And what’s remarkable about this is that these trends do not seem to show signs of topping out…”

Directly answers the question. The timestamp lets you jump straight to 25:33 in the video to hear it in context.

Step 3: Query - “LLM security attacks prompt injection jailbreak”

yt-search query 156b4111 "LLM security attacks prompt injection jailbreak"

Top result - timestamp [53:46], score -2.09:

“…suppose someone shares a Google Doc with you and you ask Bard to help you somehow with this Google Doc…well actually this Google Doc contains a prompt injection attack and Bard is hijacked with new instructions, a new prompt, and it does the following: it tries to get all the personal data or information that it has access to about you and it tries to exfiltrate it…”

Different topic, same video - the retrieval pipeline finds the exact segment where Karpathy discusses prompt injection attacks on LLMs at timestamp 53:46.

Step 4: Query - “compute optimal training”

yt-search query 156b4111 "compute optimal training"

Top result - timestamp [04:08], score -10.72:

“…you basically take a chunk of the internet that is roughly 10 terabytes of text…then you procure a GPU cluster…these are very specialized…”

This is a harder query - Karpathy doesn’t explicitly use the phrase “compute optimal” in this talk, so the scores are lower. The hybrid retrieval still finds the most relevant segment about training compute, but the cross-encoder correctly ranks it lower because it’s a weaker semantic match. This is useful signal: it tells the agent that this video doesn’t cover that specific subtopic in depth.

What an agent does with this

An AI agent like Claude Code would run all three queries, look at the scores, and synthesize:

Karpathy explains that scaling laws show LLM performance is a “remarkably smooth, well-behaved and predictable function” of just two variables: N (parameters) and D (training data). He emphasizes that these trends “do not seem to show signs of topping out,” which is what’s driving the current GPU compute arms race [25:33]. He also covers LLM security, demonstrating how prompt injection attacks can hijack models through hidden text in images or shared documents [53:46]. The talk doesn’t cover Chinchilla-style compute-optimal training directly.

That’s a 1-hour video turned into a cited summary in seconds.

Technical Details

Component	Model	Purpose
Embedding	`nomic-ai/nomic-embed-text-v1.5` (522 MB)	Semantic similarity
Reranking	`cross-encoder/ms-marco-MiniLM-L-6-v2` (6 MB)	Precision reranking
Indexing	FAISS flat inner-product + BM25	Hybrid retrieval

Built with Python. Subtitles downloaded via yt-dlp, chunked into ~400-word segments at natural boundaries, each tagged with video metadata and timestamp.

Source code: github.com/detrin/yt-search