Skip to content

ysims/EcoNetToolkit

Repository files navigation

EcoNetToolkit — simple models for ecological data

CI Lint

EcoNetToolkit lets you train a shallow neural network or classical models on your tabular ecological data using a simple YAML file.

  • CSV input with automatic preprocessing (impute, scale, encode)
  • Model zoo: MLP (shallow), Random Forest, SVM, XGBoost, Logistic Regression, Linear Regression
  • Hyperparameter tuning with grouped train/val/test splits (prevents data leakage)
  • Repeated training with different seeds for stable estimates
  • Metrics, including for unbalanced datasets (balanced accuracy, PR AUC)
  • K-fold cross-validation with spatial/temporal grouping
  • Configure the project from a single config file

Table of Contents

Getting Started

macOS and Linux (Terminal)

For these steps, open a new terminal and enter the commands in the command line.

  1. Clone the repository and move into the directory:

    git clone https://github.com/ysims/EcoNetToolkit.git
    cd EcoNetToolkit
  2. In your terminal, run:

    python3 -m venv .venv
    source .venv/bin/activate
    python -m pip install --upgrade pip
    pip install -r requirements.txt

    To leave the venv later, run deactivate.

If you have already followed these steps during a previous session, reactivate the virtual environment by opening a terminal in the EcoNetToolkit directory and run:

source .venv/bin/activate

Windows (Anaconda)

Install Anaconda (Windows 64‑bit) from the official website, using the default settings. After installation, open 'Anaconda Prompt' from the Start Menu. In the prompt, run the following steps.

  1. Get Git (this may already be installed):

    conda install git
  2. Clone the repository and move into the directory:

    git clone https://github.com/ysims/EcoNetToolkit.git
    cd EcoNetToolkit
  3. Create the conda environment and activate it:

    conda env create -f environment.yml
    conda activate econet

If the conda command isn’t recognised, make sure you’re in the Anaconda Prompt.

If you have already followed these steps during a previous session, reactivate the conda environment by opening an Anaconda Prompt in the EcoNetToolkit directory and run:

conda activate econet

Configure and Run

All commands should be run in the terminal (macOS and Linux) or the Anaconda prompt (Windows).

EcoNetToolkit includes two example datasets to help you get started.

Classification Example: Palmer Penguins

Predict penguin species from morphological measurements:

python run.py --config configs/penguins_config.yaml

This demonstrates multi-class classification (3 species: Adelie, Chinstrap, Gentoo) using features like bill length, flipper length, and body mass.

Regression Example: Possum Morphology

Predict possum age from morphological measurements:

python run.py --config configs/possum_config.yaml

This demonstrates continuous variable prediction using head length, skull width, and other physical measurements.

Outputs

Outputs are organised into folders based on your config file name. For example, running configs/possum_config.yaml creates:

outputs/
└── possum_config/                    # Named after your config file
    ├── random_forest/                # Model-specific subfolder
    │   ├── model_random_forest_seed42.joblib
    │   ├── model_random_forest_seed43.joblib
    │   ├── ...
    │   ├── report_random_forest.json
    │   ├── confusion_matrix_random_forest.png   (classification only)
    │   ├── pr_curve_random_forest.png           (classification only)
    │   └── residual_plot_random_forest.png      (regression only)
    ├── xgboost/
    ├── mlp/
    ├── svm/
    ├── linear/
    ├── report_all_models.json        # Combined results across all models
    ├── comparison_mse.png             # Comparison plots (regression)
    ├── comparison_r2.png
    ├── comparison_accuracy.png        # Comparison plots (classification)
    ├── comparison_f1.png
    └── pr_curve_comparison.png        # Combined PR curves (classification)

Model-specific outputs (in each model subfolder):

  • model_<name>_seed<N>.joblib: trained models for each random seed
  • report_<model>.json: per-seed metrics (MSE, R², accuracy, F1, etc.)
  • confusion_matrix_<model>.png: confusion matrix heatmap (classification)
  • pr_curve_<model>.png: precision-recall curve (classification)
  • residual_plot_<model>.png: predicted vs actual and residuals (regression)

Multi-model comparison outputs (in the config folder root):

  • report_all_models.json: combined metrics across all models and seeds
  • comparison_*.png: side-by-side boxplots comparing model performance
  • pr_curve_comparison.png: overlaid precision-recall curves (classification)

Inspecting Saved Models

python inspect_models.py outputs/possum_config/random_forest/model_*.joblib

This shows model type, parameters, training iterations, and other metadata. The .joblib files contain serialised scikit-learn models that you can load and use for predictions:

import joblib
model = joblib.load('outputs/possum_config/random_forest/model_random_forest_seed42.joblib')
predictions = model.predict(X_new)  # X_new must be preprocessed the same way

Config reference (YAML)

You can train single or multiple models for comparison. See configs/penguins_config.yaml for comprehensive examples of all model types and their parameters.

Simple example (single model, classification)

problem_type: classification

data:
    path: data/sample.csv
    features: [f1, f2, habitat]
    label: label
    test_size: 0.2
    val_size: 0.2
    random_state: 0
    scaling: standard
    impute_strategy: mean

models:
  - name: mlp
    params:
      hidden_layer_sizes: [32, 16]
      max_iter: 300
      early_stopping: true

training:
    repetitions: 5
    random_seed: 0

# Optional: specify output directory (defaults to outputs/<config_name>/)
output:
    dir: outputs/my_experiment

Note: If output.dir is not specified, outputs are automatically saved to outputs/<config_name>/ where <config_name> is derived from your config file name.

Multi-output (multi-target) prediction

EcoNetToolkit supports predicting multiple target variables simultaneously (multi-output learning). This is useful when you want to predict several related outcomes from the same features.

Example: Multi-output classification

problem_type: classification

data:
    path: data/palmerpenguins_extended.csv
    features: [bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, island]
    labels: [species, sex, life_stage]  # Predict 3 labels simultaneously
    test_size: 0.2
    val_size: 0.15
    scaling: standard

models:
  - name: random_forest
    params:
      n_estimators: 100
      max_depth: 15

training:
    repetitions: 5

Example: Multi-output regression

problem_type: regression

data:
    path: data/possum.csv
    features: [hdlngth, skullw, totlngth, taill]
    labels: [age, chest, belly]  # Predict 3 continuous values
    test_size: 0.2
    scaling: standard

models:
  - name: mlp
    params:
      hidden_layer_sizes: [32, 16]
      max_iter: 500

Key points:

  • Use labels: (list) instead of label: (single string) to specify multiple targets
  • For backward compatibility, label: still works for single-output prediction
  • Multi-output metrics report mean and standard deviation across all outputs
  • Some models support multi-output natively (Random Forest, MLP Regressor), others are wrapped automatically (Logistic Regression, SVM, Linear Regression)

See configs/penguins_multilabel.yaml and configs/possum_multilabel.yaml for complete examples.

Available models and key parameters

MLP (Multi-Layer Perceptron)

  • hidden_layer_sizes: List of layer sizes, e.g., [32, 16]
  • max_iter: Maximum iterations
  • early_stopping: Stop when validation plateaus
  • n_iter_no_change: Patience - epochs to wait without improvement (default: 10)
  • validation_fraction: Fraction of training data for validation
  • alpha: L2 regularisation
  • learning_rate_init: Initial learning rate

Random Forest

  • n_estimators: Number of trees
  • max_depth: Max tree depth (null = unlimited)
  • min_samples_split: Min samples to split
  • max_features: Features per split (sqrt, log2, or null)

SVM (Support Vector Machine)

  • C: Regularisation parameter
  • kernel: rbf, linear, poly, or sigmoid
  • gamma: Kernel coefficient (scale or auto)

XGBoost

  • n_estimators: Boosting rounds
  • max_depth: Max tree depth
  • learning_rate: Step size (eta)
  • subsample: Training instance ratio
  • colsample_bytree: Feature ratio

Logistic Regression (classification only)

  • C: Inverse regularisation strength
  • max_iter: Max solver iterations
  • solver: lbfgs, liblinear, newton-cg, etc.
  • penalty: l1, l2, elasticnet, or null

Linear Regression (regression only)

  • fit_intercept: Whether to calculate the intercept (default: true)

Notes on metrics

Classification:

  • Primary ranking metric: Cohen's kappa (accounts for chance agreement, robust for imbalanced data)
  • Also reported: accuracy, balanced accuracy, precision, recall, F1, ROC-AUC, PR-AUC

Regression:

  • Primary ranking metric: MSE (Mean Squared Error, lower is better)
  • Also reported: RMSE, MAE, R², MAPE

Additional notes

  • For classification with two classes, ROC-AUC and PR-AUC are computed if the model can produce probabilities (e.g., MLP, RandomForest, SVM with probability=True).
  • For multi-class problems, macro-averaged Precision/Recall/F1 summarise performance across all classes.
  • Models are ranked by Cohen's kappa (classification) or MSE (regression) to identify the best performer.

Hyperparameter Tuning

EcoNetToolkit includes automated hyperparameter tuning with proper train/validation/test splits to prevent data leakage. This is especially important for ecological data with spatial or temporal structure.

Quick Example:

python run.py --config configs/mangrove_tuning.yaml

Key Features:

  • Grouped splits: Assign groups (e.g., patches, sites, years) to train/val/test sets
  • Automatic search: GridSearchCV or RandomizedSearchCV to find optimal hyperparameters
  • Multiple seeds: Run with different random seeds for stable results
  • Proper evaluation: Tune on train+val, evaluate on held-out test set

Example Config:

problem_type: regression

data:
  path: data/mangrove.csv
  cv_group_column: patch_id    # Group by spatial patches
  n_train_groups: 4            # 4 patches for training
  n_val_groups: 2              # 2 patches for validation (tuning)
  n_test_groups: 2             # 2 patches for test (final eval)
  
  labels: [NDVI]
  features: [pu_x, pu_y, temperature, ...]
  scaling: standard

# Enable hyperparameter tuning
tuning:
  enabled: true
  search_method: random       # "random" or "grid"
  n_iter: 30                  # Number of parameter combinations
  cv_folds: 3                 # CV folds during tuning
  scoring: neg_mean_squared_error
  n_jobs: -1                  # Use all CPU cores

# Define models and search spaces
models:
  - name: random_forest
    param_space:
      n_estimators: [100, 200, 500, 1000]
      max_depth: [10, 20, 30, 50]
      min_samples_split: [2, 5, 10]
      max_features: [sqrt, log2, 0.5]

training:
  repetitions: 5
  random_seed: 42

Outputs include:

  • Best hyperparameters for each seed
  • Validation and test set performance
  • Comparison plots
  • Trained models with optimal parameters

For detailed information, see docs/HYPERPARAMETER_TUNING.md

Using your own data

  1. Place your CSV file in the data folder.

  2. Make a yaml config file in the configs folder for your data.

    Use one of the existing config files (penguin for classification; possum for regression) as a basis for your data. Change the CSV path to point to your CSV file and change the features and label parameters to match the columns in your CSV file. The parameters for the different models should be tuned for your problem.

    If you are unsure how to make the yaml file, try providing ChatGPT (or your favourite LLM) with your CSV file (or the first few rows) and link to this repository and ask it to make a config file for your data. Consider data privacy before doing this.

Some tips:

  • Ensure your features: list includes only columns available in your CSV.
  • Text categories are automatically one-hot encoded.
  • If your dataset is very imbalanced, consider class_weight: balanced in model.params for logistic or svm, or tune scale_pos_weight for xgboost.

Testing

Testing is provided for development purposes and is used by the CI system when pull requests are created.

Unit Tests

Run the test suite to ensure everything works correctly:

python run_tests.py all -v

Or run with coverage:

python run_tests.py all -v --cov=ecosci --cov-report=html

The tests verify:

  • Data loading and preprocessing (scaling, encoding, splits)
  • Model instantiation and training for all model types
  • Metric computation produces sane values (0-1 ranges, no NaN/inf)
  • Full end-to-end pipeline runs without errors
  • Models produce reasonable accuracy (better than random)

End-to-End Testing

Test the full pipeline with the included example datasets:

Classification (Penguins):

python run.py --config configs/penguins_config.yaml

Regression (Possum):

python run.py --config configs/possum_config.yaml

These demonstrate that the toolkit works correctly for both problem types and generates appropriate metrics and visualisations.

Troubleshooting

  • Shapes or column errors: double-check your features: and label: names.
  • No probabilities for some models: not all models support predict_proba; plots that need probabilities are skipped automatically.

Development layout

  • run.py — simple entrypoint script
  • ecosci/ — package with modules:
    • config.py (YAML reader)
    • data.py (CSV loader + preprocessing)
    • models.py (ModelZoo)
    • trainer.py (seeded training loop, saving models)
    • eval.py (metrics and plots)
  • configs/ — example configuration
  • data/ — sample CSV for quick testing

About

A toolkit for ecology-focused artificial neural networks, to compare against classical methods.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages