OUROBOROS

Research Questions

What is the likelihood of AI content to be cited in AI Overviews? What is the correlation of AI content being cited in AI overview vs human content being cited in AI overviews? (positive correlation or negative correlation?)
What proportion of sources cited in Google AI Overviews are AI-generated?
How is AI/human ratio distributed across different query topics?
What queries do not result in an AI summary? Which queries result in "An AI Overview is not available for this search"?
Does this impact the quality of overviews?
What are the implications for SEO?

Timeline

Week 1 (June 9+): Plan the study and outline the data collection pipeline
Week 2 (June 16+): Refined core methodology to measure P(cited | AI) vs P(cited | human). Ran citation coverage analysis across N = 10 to 80 and selected N = 40 as optimal.
Week 3 (June 23+): Scale up sample size (v2) and classify documents using Originality.ai. Compute preliminary citation probabilities for AI vs human content.
Week 4 (June 30+): Finalize main results (P(cited | AI) vs P(cited | human)), generate supporting tables and plots, and begin drafting key insights.
Weeks 5–7 (July 7+): Explore secondary questions (e.g., domain-level trends, content depth), refine plots and findings, and begin outlining blog post.
Week 8 (July 28+): Draft the blog post and prepare final graphics and visualizations for publication.
Week 9 (August 4+): Optional: package technical findings into a short paper or internal report.

Project Setup

Download the MS MARCO queries dataset (9.2M real Bing queries) and save it as /dataset/ms-marco-web-search-queries.tsv.

Project Structure

/datasets - all source datasets.
/samples - contains sampled queries from MS MARCO dataset.
- e.g. folder named v1_50 means v1 filter was used (filters out queries that are unlikely to trigger AI overviews) and 50 queries were randomly sampled
- Files prefixed with queries_ contain queries and their unique IDs
- Query files postfixed with _labeled contain an additional triggered_ai_overview column:
  - y: query triggers an AI overview
  - n: no AI overview
  - b: attempted to show an AI overview, but blocked by Google's policies (" An AI Overview is not available for this search" is displayed). E.g. why liberals hate america query.

Related Work

GEO: Generative Engine Optimization - a large-scale benchmark of diverse user queries across multiple domains, along with relevant web sources to answer these queries. Through rigorous evaluation, we demonstrate that GEO can boost visibility by up to 40% in generative engine responses.
How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices? (arXiv:2504.02767) - analyzes ~275,000 GPT‑4 citations across 10,000 simulated papers, revealing a systematic preference for highly cited, recent sources. The “rich get richer” effect in AI-generated references.
From Content Creation to Citation Inflation: A GenAI Case Study Examines AI-generated papers that cite each other, artificially boosting each other's credibility.

Model collapse when training on AI data:

The Curse of Recursion: Training on Generated Data Makes Models Forget (Shumailov et al., 2023) - lose diversity and information over time
AI models collapse when trained on recursively generated data (Nature, 2024) - degrade in quality and accuracy

Data Collection

Online Query Datasets

Dataset	# Queries	Recency	Source Type	Notes
MS MARCO Web Search	~10 million	~2024	Human (Bing search logs)	Real-world queries; main dataset
ORCAS	~10 million	~2020	Human (click logs)	Includes query-document pairs with user click signals
Natural Questions	~320,000	~2019	Human (Google QA queries)	QA-focused dataset with gold answers

We selected MS MARCO Web Search as our primary dataset because:

Large, diverse set of real user queries from Bing
Recency (2024), reflecting modern search behavior
Representative of average user search, covers a wide range of query types
Well documented and formatted

Data Collection Pipeline

Sample Queries
Generate or select a large set of queries predicted to trigger AI Overviews using the WTAO filter.
Run Queries & Collect Responses
For each query, retrieve:

AI Overview response with all cited URLs
Top N organic search result URLs (configurable N, e.g., 10 or 20)

Combine & Deduplicate URLs
Merge URLs from both organic results and AI Overview citations.
Normalize URLs to avoid duplicates (e.g., remove query parameters, consistent casing).
Deduplicate to create a master pool of unique URLs.
Label URLs with Citation & Organic Counts
For each URL, track:
- cited_count: Number of times cited by AI Overviews across all queries
- in_organic_results_count: Number of times appearing in organic results
Classify URLs as AI-generated or Human-written
Use Originality.ai Batch Scan API to classify each URL's content.
Store classification results including confidence scores and labels.
Calculate Citation Probabilities and Analyze
Compute conditional probabilities:
- P(cited | AI-generated) = (# cited AI URLs) / (# total AI URLs)
- P(cited | human-written) = (# cited human URLs) / (# total human URLs)
Analyze citation frequency distributions, overlap ratios, and trends over time or by query category.

WTAO Filter Stats

Version	Sample Size	Y	N	B
v1	50	48%	18%	34%

N organic results citation presence stats (v1_50 sample)

N (organic results)	Matches	Total Cited	Proportion (%)
10	89	282	31.56%
20	117	294	39.80%
40	150	282	53.19%
80	162	298	54.36%

Name		Name	Last commit message	Last commit date
Latest commit History 322 Commits
.vscode		.vscode
ai_slop_study		ai_slop_study
figures		figures
samples		samples
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
1a_sample_queries.py		1a_sample_queries.py
1b_classify_ymyl_queries.py		1b_classify_ymyl_queries.py
2_get_google_ai_overviews.py		2_get_google_ai_overviews.py
3_pool_docs.py		3_pool_docs.py
4_scrape_and_classify_ai_human.py		4_scrape_and_classify_ai_human.py
5_analysis_of_1k_sample.ipynb		5_analysis_of_1k_sample.ipynb
5_analysis_of_29k_ymyl.ipynb		5_analysis_of_29k_ymyl.ipynb
5_probability_analysis_of_1k_sample.ipynb		5_probability_analysis_of_1k_sample.ipynb
5_probability_analysis_of_29k_ymyl.ipynb		5_probability_analysis_of_29k_ymyl.ipynb
README.md		README.md
wtao_filters.py		wtao_filters.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OUROBOROS

Research Questions

Timeline

Project Setup

Project Structure

Related Work

Data Collection

Online Query Datasets

Data Collection Pipeline

WTAO Filter Stats

N organic results citation presence stats (v1_50 sample)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OUROBOROS

Research Questions

Timeline

Project Setup

Project Structure

Related Work

Data Collection

Online Query Datasets

Data Collection Pipeline

WTAO Filter Stats

N organic results citation presence stats (v1_50 sample)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages