Skip to content

radoslawrolka/ElixirDatasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ElixirDatasets

Hex.pm Documentation License: MIT

ElixirDatasets is a comprehensive library for accessing and managing datasets from Hugging Face Hub in Elixir. Inspired by the Python datasets library, it brings powerful dataset management capabilities to the Elixir ecosystem with seamless integration with Explorer DataFrames.

✨ Features

  • 🚀 Easy Access to Hugging Face Hub - Load thousands of datasets with a single function call
  • 📊 Explorer Integration - Automatic conversion to Explorer DataFrames for data manipulation
  • 💾 Smart Caching - Intelligent local caching to avoid redundant downloads
  • 🌊 Streaming Support - Process large datasets without loading everything into memory
  • 📤 Upload Datasets - Publish your own datasets to Hugging Face Hub
  • 🔒 Private Repositories - Full support for authentication and private datasets
  • 🎯 Multiple Formats - Support for CSV, Parquet, and JSONL files

📦 Installation

Add elixir_datasets to your list of dependencies in mix.exs:

def deps do
  [
    {:elixir_datasets, "~> 0.1.0"}
  ]
end

🚀 Quick Start

{:ok, [train_df]} = ElixirDatasets.load_dataset(
  {:hf, "cornell-movie-review-data/rotten_tomatoes"},
  split: "train"
)

{:ok, datasets} = ElixirDatasets.load_dataset({:local, "./data"})

{:ok, stream} = ElixirDatasets.load_dataset(
  {:hf, "stanfordnlp/imdb", subdir: "plain_text"},
  split: "train",
  streaming: true
)

stream |> Enum.take(100) |> IO.inspect()

📚 Examples

All examples can be found in the examples directory.

  • examples/usage_examples.livemd - Comprehensive usage examples of the elixir_datasets api
  • examples/integration_examples.livemd - Examples demonstrating integration with other Elixir libraries like Nx, Axon, and Bumblebee

🔧 Configuration

Environment Variables

  • ELIXIR_DATASETS_CACHE_DIR - Custom cache directory
  • ELIXIR_DATASETS_OFFLINE - Enable offline mode ("1" or "true")
  • HF_TOKEN - Authentication token for private datasets
  • [🚧 In-progress] HF_DEBUG - Enable debug logging ("1" or "true")

📖 Documentation

Full documentation is available at HexDocs and hosted on GitHub Pages for current status of under-development features. Documentation can be generated locally using:

mix docs

🧪 Testing

MIX_ENV=test mix test

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Copyright (c) 2025 Radosław Rolka, Weronika Wojtas


About

Elixir hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages