How big is the dataset?

It covers 8,226+ validated VSLs and ads across 72+ niches, totaling about 12.5 TB, and is refreshed nightly. A license is scoped to the niches you need, so you are not required to take the entire corpus.

How is the data produced?

Assets are first validated as winners by hand, then run through an AI pipeline: contextual retrieval, unit extraction on a Haiku-tier model, embeddings, online clustering, and per-niche aggregates. The pipeline reruns nightly to fold in new assets.

What can I use the dataset for?

Common uses are training or fine-tuning copy models, building internal swipe and search tools, competitive and market research, and agency creative operations. It is research intelligence meant to inform analysis and models, not to reproduce another advertiser's copy verbatim.

Licensing is on demand. Tell us your niches and use case through the contact form and we scope a structured export or an ongoing feed against the corpus, with field selection, niche scope, and refresh cadence agreed during scoping.

Can I get the data without raw export?

Yes. The same corpus is queryable through the member-facing AI Copy Agent's retrieval tools, so teams that want the retrieval layer rather than the raw records can work through the agent instead of a bulk export.

VSL & Ad Creative Dataset — License On Demand

8,226+

Validated winning VSLs & ad creatives

72+

Direct-response niches covered

12.5 TB

Total corpus, refreshed nightly

What the dataset is

The VSL & Ad Creative Dataset is the same structured corpus that powers Daily Intel Service's AI Copy Agent — exposed as data you can license directly. It is not a scrape of every ad in a public library. It is a curated set of 8,226+ VSLs and ads across 72+ direct-response niches that were first validated as winners, then run through an extraction pipeline that turns raw video and ad transcripts into machine-readable records.

At about 12.5 TB and refreshed nightly, the dataset is designed to be queried, joined, and modeled — not just read. Every asset is normalized into the same schema, so a record from a financial VSL and a record from a supplement ad expose the same fields. That uniformity is what makes the corpus usable for training, analytics, and internal tooling rather than only for manual swipe browsing.

What a single record contains

Each asset resolves to a transcript plus a source_kind that marks it as a VSL or an ad, the niche it belongs to, and a layered set of derived fields. The most important layer is the extracted units: every transcript is decomposed into twelve unit kinds — hooks, pains, tactics, authorities, promises, social proof, urgency, CTAs, vocabulary, villains, avatars, and mechanisms — so you can ask structural questions like 'which mechanisms recur in this niche' without parsing prose yourself.

Above the units sits the chunk layer. Each transcript is split into windowed chunks tagged with a section label — opening, big idea, problem, agitation, mechanism, social proof, offer, urgency/scarcity, close for long-form, and hook / agitation / promise / CTA for ads. That labeling lets you study a corpus by section: pull every opening across a niche, or every offer stack, and compare structure directly. Finally, each unit carries cluster and pattern membership, so records are pre-grouped by the recurring patterns they belong to rather than left as isolated rows.

TranscriptFull text of every validated VSL and ad

source_kindVSL or ad, with niche assignment

Extracted units12 kinds: hooks, pains, mechanisms, offers, more

Windowed chunksSection-labeled (opening, offer, close, ad hook…)

Cluster membershipPre-grouped by recurring pattern

Per-niche aggregatesTop hooks, pains, tactics, vocabulary per niche

How the dataset is built

The pipeline is two-sided: human validation in front, automated extraction behind. Assets are first selected and validated as genuine winners before anything is processed — the corpus is deliberately a set of proven creatives, not an undifferentiated dump. Validated transcripts then pass through a contextual-retrieval step that situates each chunk in the surrounding document before it is indexed, which keeps short excerpts interpretable on their own.

Extraction of the twelve unit kinds runs on a compact, cost-efficient model (Anthropic's Haiku tier) so the pipeline scales across the full corpus. Extracted units and chunks are embedded into vector space, then grouped by an online clustering process that updates as new assets land each night. Per-niche aggregates — the top hooks, pains, tactics, authorities, villains, and vocabulary for each vertical — are recomputed from those clusters, giving you both row-level records and pre-rolled niche summaries.

What teams build with it

Four use cases come up most. The first is training and fine-tuning: the labeled units and section-tagged chunks make the corpus a ready-made supervised signal for teams building their own copy or classification models, without having to assemble and annotate creatives from scratch. The second is internal swipe and research tooling — wiring the structured records into a private search interface so your writers query patterns the way the AI Copy Agent does, inside your own product.

The third is competitive and market research: because every record is tagged by niche and grouped into clusters, analysts can quantify how a market frames pain, which mechanisms dominate, and how those patterns shift over the nightly refresh. The fourth is agency creative operations — standing up a shared, structured reference layer so a creative team works from the same validated patterns instead of scattered personal folders. In every case the dataset is research intelligence, not a copy-and-paste source: it is meant to inform analysis and model training, not to reproduce another advertiser's creative verbatim.

License the dataset on demand

Tell us your niches and use case — we scope a dataset export or API feed for your team.

Request dataset access

Coverage and freshness

Coverage today spans 8,226+ validated VSLs and ads across 72+ niches, totaling roughly 12.5 TB. The corpus is intentionally broad across the direct-response spectrum — aggressive financial and opportunity angles sit alongside compliant supplement and health framing — so a license can be scoped to exactly the verticals your team works in rather than forcing you to take the whole set.

Freshness is a first-class property. The pipeline runs nightly, so new validated assets, their extractions, and the clusters they belong to are folded in continuously. That means a licensed feed reflects what has been validated recently, and per-niche aggregates move as the market moves — useful for any analysis that needs to distinguish an enduring pattern from a passing one.

How licensing and delivery work

Licensing is on demand and scoped to your niches. Rather than a fixed self-serve download, an engagement starts with the niches and the use case you bring — training corpus, internal tool, research, or agency reference — and we scope an export or feed against the 8,226+-asset corpus to match it. That keeps you from paying for verticals you will never query and lets us right-size the volume to your model or product.

Delivery can take the shape of a one-time structured export of records (transcripts, units, chunks, and cluster membership) or an ongoing feed that tracks the nightly refresh, depending on whether you are training once or running continuous analysis. Field selection, niche scope, and refresh cadence are all part of the scoping conversation. The same data also remains queryable through the member-facing AI Copy Agent for teams that want the retrieval layer rather than the raw corpus.

The bottom line

The VSL & Ad Creative Dataset turns 8,226+ validated winning VSLs and ads into a uniform, queryable corpus — transcripts, twelve kinds of extracted units, section-labeled chunks, and cluster membership across 72+ niches, refreshed nightly. Licensed on demand and scoped to your verticals, it is built to train models, power internal tooling, and ground market research on what has actually been validated as winning.

Frequently asked questions

How big is the dataset?
It covers 8,226+ validated VSLs and ads across 72+ niches, totaling about 12.5 TB, and is refreshed nightly. A license is scoped to the niches you need, so you are not required to take the entire corpus.
What does each record include?
A full transcript, a source kind (VSL or ad), the niche, twelve kinds of extracted units (hooks, pains, tactics, authorities, promises, social proof, urgency, CTAs, vocabulary, villains, avatars, mechanisms), section-labeled windowed chunks, and cluster/pattern membership.
How is the data produced?
Assets are first validated as winners by hand, then run through an AI pipeline: contextual retrieval, unit extraction on a Haiku-tier model, embeddings, online clustering, and per-niche aggregates. The pipeline reruns nightly to fold in new assets.
What can I use the dataset for?
Common uses are training or fine-tuning copy models, building internal swipe and search tools, competitive and market research, and agency creative operations. It is research intelligence meant to inform analysis and models, not to reproduce another advertiser's copy verbatim.
How do I license it?
Licensing is on demand. Tell us your niches and use case through the contact form and we scope a structured export or an ongoing feed against the corpus, with field selection, niche scope, and refresh cadence agreed during scoping.
Can I get the data without raw export?
Yes. The same corpus is queryable through the member-facing AI Copy Agent's retrieval tools, so teams that want the retrieval layer rather than the raw records can work through the agent instead of a bulk export.