- What is the likelihood of AI content to be cited in AI Overviews? What is the correlation of AI content being cited in AI overview vs human content being cited in AI overviews? (positive correlation or negative correlation?)
- What proportion of sources cited in Google AI Overviews are AI-generated?
- How is AI/human ratio distributed across different query topics?
- What queries do not result in an AI summary? Which queries result in "An AI Overview is not available for this search"?
- Does this impact the quality of overviews?
- What are the implications for SEO?
- Week 1 (June 9+): Plan the study and outline the data collection pipeline
- Week 2 (June 16+): Refined core methodology to measure P(cited | AI) vs P(cited | human). Ran citation coverage analysis across N = 10 to 80 and selected N = 40 as optimal.
- Week 3 (June 23+): Scale up sample size (v2) and classify documents using Originality.ai. Compute preliminary citation probabilities for AI vs human content.
- Week 4 (June 30+): Finalize main results (P(cited | AI) vs P(cited | human)), generate supporting tables and plots, and begin drafting key insights.
- Weeks 5–7 (July 7+): Explore secondary questions (e.g., domain-level trends, content depth), refine plots and findings, and begin outlining blog post.
- Week 8 (July 28+): Draft the blog post and prepare final graphics and visualizations for publication.
- Week 9 (August 4+): Optional: package technical findings into a short paper or internal report.
- Download the MS MARCO queries dataset (9.2M real Bing queries) and save it as
/dataset/ms-marco-web-search-queries.tsv.
/datasets- all source datasets./samples- contains sampled queries from MS MARCO dataset.- e.g. folder named
v1_50meansv1filter was used (filters out queries that are unlikely to trigger AI overviews) and 50 queries were randomly sampled - Files prefixed with
queries_contain queries and their unique IDs - Query files postfixed with
_labeledcontain an additionaltriggered_ai_overviewcolumn:y: query triggers an AI overviewn: no AI overviewb: attempted to show an AI overview, but blocked by Google's policies (" An AI Overview is not available for this search" is displayed). E.g.why liberals hate americaquery.
- e.g. folder named
-
GEO: Generative Engine Optimization - a large-scale benchmark of diverse user queries across multiple domains, along with relevant web sources to answer these queries. Through rigorous evaluation, we demonstrate that GEO can boost visibility by up to 40% in generative engine responses.
-
How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices? (arXiv:2504.02767) - analyzes ~275,000 GPT‑4 citations across 10,000 simulated papers, revealing a systematic preference for highly cited, recent sources. The “rich get richer” effect in AI-generated references.
-
From Content Creation to Citation Inflation: A GenAI Case Study Examines AI-generated papers that cite each other, artificially boosting each other's credibility.
Model collapse when training on AI data:
- The Curse of Recursion: Training on Generated Data Makes Models Forget (Shumailov et al., 2023) - lose diversity and information over time
- AI models collapse when trained on recursively generated data (Nature, 2024) - degrade in quality and accuracy
| Dataset | # Queries | Recency | Source Type | Notes |
|---|---|---|---|---|
| MS MARCO Web Search | ~10 million | ~2024 | Human (Bing search logs) | Real-world queries; main dataset |
| ORCAS | ~10 million | ~2020 | Human (click logs) | Includes query-document pairs with user click signals |
| Natural Questions | ~320,000 | ~2019 | Human (Google QA queries) | QA-focused dataset with gold answers |
We selected MS MARCO Web Search as our primary dataset because:
- Large, diverse set of real user queries from Bing
- Recency (2024), reflecting modern search behavior
- Representative of average user search, covers a wide range of query types
- Well documented and formatted
-
Sample Queries
Generate or select a large set of queries predicted to trigger AI Overviews using the WTAO filter. -
Run Queries & Collect Responses
For each query, retrieve:
- AI Overview response with all cited URLs
- Top N organic search result URLs (configurable N, e.g., 10 or 20)
-
Combine & Deduplicate URLs
Merge URLs from both organic results and AI Overview citations.
Normalize URLs to avoid duplicates (e.g., remove query parameters, consistent casing).
Deduplicate to create a master pool of unique URLs. -
Label URLs with Citation & Organic Counts
For each URL, track:cited_count: Number of times cited by AI Overviews across all queriesin_organic_results_count: Number of times appearing in organic results
-
Classify URLs as AI-generated or Human-written
Use Originality.ai Batch Scan API to classify each URL's content.
Store classification results including confidence scores and labels. -
Calculate Citation Probabilities and Analyze
Compute conditional probabilities:- P(cited | AI-generated) = (# cited AI URLs) / (# total AI URLs)
- P(cited | human-written) = (# cited human URLs) / (# total human URLs)
Analyze citation frequency distributions, overlap ratios, and trends over time or by query category.
| Version | Sample Size | Y | N | B |
|---|---|---|---|---|
| v1 | 50 | 48% | 18% | 34% |
| N (organic results) | Matches | Total Cited | Proportion (%) |
|---|---|---|---|
| 10 | 89 | 282 | 31.56% |
| 20 | 117 | 294 | 39.80% |
| 40 | 150 | 282 | 53.19% |
| 80 | 162 | 298 | 54.36% |