Skip to content

OriginalityAI/ai-citation-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

322 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OUROBOROS

Research Questions

  • What is the likelihood of AI content to be cited in AI Overviews? What is the correlation of AI content being cited in AI overview vs human content being cited in AI overviews? (positive correlation or negative correlation?)
  • What proportion of sources cited in Google AI Overviews are AI-generated?
  • How is AI/human ratio distributed across different query topics?
  • What queries do not result in an AI summary? Which queries result in "An AI Overview is not available for this search"?
  • Does this impact the quality of overviews?
  • What are the implications for SEO?

Timeline

  • Week 1 (June 9+): Plan the study and outline the data collection pipeline
  • Week 2 (June 16+): Refined core methodology to measure P(cited | AI) vs P(cited | human). Ran citation coverage analysis across N = 10 to 80 and selected N = 40 as optimal.
  • Week 3 (June 23+): Scale up sample size (v2) and classify documents using Originality.ai. Compute preliminary citation probabilities for AI vs human content.
  • Week 4 (June 30+): Finalize main results (P(cited | AI) vs P(cited | human)), generate supporting tables and plots, and begin drafting key insights.
  • Weeks 5–7 (July 7+): Explore secondary questions (e.g., domain-level trends, content depth), refine plots and findings, and begin outlining blog post.
  • Week 8 (July 28+): Draft the blog post and prepare final graphics and visualizations for publication.
  • Week 9 (August 4+): Optional: package technical findings into a short paper or internal report.

Project Setup

  1. Download the MS MARCO queries dataset (9.2M real Bing queries) and save it as /dataset/ms-marco-web-search-queries.tsv.

Project Structure

  • /datasets - all source datasets.
  • /samples - contains sampled queries from MS MARCO dataset.
    • e.g. folder named v1_50 means v1 filter was used (filters out queries that are unlikely to trigger AI overviews) and 50 queries were randomly sampled
    • Files prefixed with queries_ contain queries and their unique IDs
    • Query files postfixed with _labeled contain an additional triggered_ai_overview column:
      • y: query triggers an AI overview
      • n: no AI overview
      • b: attempted to show an AI overview, but blocked by Google's policies (" An AI Overview is not available for this search" is displayed). E.g. why liberals hate america query.

Related Work

Model collapse when training on AI data:

Data Collection

Online Query Datasets

Dataset # Queries Recency Source Type Notes
MS MARCO Web Search ~10 million ~2024 Human (Bing search logs) Real-world queries; main dataset
ORCAS ~10 million ~2020 Human (click logs) Includes query-document pairs with user click signals
Natural Questions ~320,000 ~2019 Human (Google QA queries) QA-focused dataset with gold answers

We selected MS MARCO Web Search as our primary dataset because:

  • Large, diverse set of real user queries from Bing
  • Recency (2024), reflecting modern search behavior
  • Representative of average user search, covers a wide range of query types
  • Well documented and formatted

Data Collection Pipeline

  1. Sample Queries
    Generate or select a large set of queries predicted to trigger AI Overviews using the WTAO filter.

  2. Run Queries & Collect Responses
    For each query, retrieve:

  • AI Overview response with all cited URLs
  • Top N organic search result URLs (configurable N, e.g., 10 or 20)
  1. Combine & Deduplicate URLs
    Merge URLs from both organic results and AI Overview citations.
    Normalize URLs to avoid duplicates (e.g., remove query parameters, consistent casing).
    Deduplicate to create a master pool of unique URLs.

  2. Label URLs with Citation & Organic Counts
    For each URL, track:

    • cited_count: Number of times cited by AI Overviews across all queries
    • in_organic_results_count: Number of times appearing in organic results
  3. Classify URLs as AI-generated or Human-written
    Use Originality.ai Batch Scan API to classify each URL's content.
    Store classification results including confidence scores and labels.

  4. Calculate Citation Probabilities and Analyze
    Compute conditional probabilities:

    • P(cited | AI-generated) = (# cited AI URLs) / (# total AI URLs)
    • P(cited | human-written) = (# cited human URLs) / (# total human URLs)

    Analyze citation frequency distributions, overlap ratios, and trends over time or by query category.

WTAO Filter Stats

Version Sample Size Y N B
v1 50 48% 18% 34%

N organic results citation presence stats (v1_50 sample)

N (organic results) Matches Total Cited Proportion (%)
10 89 282 31.56%
20 117 294 39.80%
40 150 282 53.19%
80 162 298 54.36%

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors