EcoNetToolkit — simple models for ecological data

EcoNetToolkit lets you train a shallow neural network or classical models on your tabular ecological data using a simple YAML file.

CSV input with automatic preprocessing (impute, scale, encode)
Model zoo: MLP (shallow), Random Forest, SVM, XGBoost, Logistic Regression, Linear Regression
Hyperparameter tuning with grouped train/val/test splits (prevents data leakage)
Repeated training with different seeds for stable estimates
Metrics, including for unbalanced datasets (balanced accuracy, PR AUC)
K-fold cross-validation with spatial/temporal grouping
Configure the project from a single config file

Getting Started

macOS and Linux (Terminal)

For these steps, open a new terminal and enter the commands in the command line.

Clone the repository and move into the directory:

git clone https://github.com/ysims/EcoNetToolkit.git
cd EcoNetToolkit

In your terminal, run:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt

To leave the venv later, run deactivate.

If you have already followed these steps during a previous session, reactivate the virtual environment by opening a terminal in the EcoNetToolkit directory and run:

source .venv/bin/activate

Windows (Anaconda)

Install Anaconda (Windows 64‑bit) from the official website, using the default settings. After installation, open 'Anaconda Prompt' from the Start Menu. In the prompt, run the following steps.

Get Git (this may already be installed):
```
conda install git
```

Clone the repository and move into the directory:

git clone https://github.com/ysims/EcoNetToolkit.git
cd EcoNetToolkit

Create the conda environment and activate it:

conda env create -f environment.yml
conda activate econet

If the conda command isn’t recognised, make sure you’re in the Anaconda Prompt.

If you have already followed these steps during a previous session, reactivate the conda environment by opening an Anaconda Prompt in the EcoNetToolkit directory and run:

conda activate econet

Configure and Run

All commands should be run in the terminal (macOS and Linux) or the Anaconda prompt (Windows).

EcoNetToolkit includes two example datasets to help you get started.

Classification Example: Palmer Penguins

Predict penguin species from morphological measurements:

python run.py --config configs/penguins_config.yaml

This demonstrates multi-class classification (3 species: Adelie, Chinstrap, Gentoo) using features like bill length, flipper length, and body mass.

Regression Example: Possum Morphology

Predict possum age from morphological measurements:

python run.py --config configs/possum_config.yaml

This demonstrates continuous variable prediction using head length, skull width, and other physical measurements.

Outputs

Outputs are organised into folders based on your config file name. For example, running configs/possum_config.yaml creates:

outputs/
└── possum_config/                    # Named after your config file
    ├── random_forest/                # Model-specific subfolder
    │   ├── model_random_forest_seed42.joblib
    │   ├── model_random_forest_seed43.joblib
    │   ├── ...
    │   ├── report_random_forest.json
    │   ├── confusion_matrix_random_forest.png   (classification only)
    │   ├── pr_curve_random_forest.png           (classification only)
    │   └── residual_plot_random_forest.png      (regression only)
    ├── xgboost/
    ├── mlp/
    ├── svm/
    ├── linear/
    ├── report_all_models.json        # Combined results across all models
    ├── comparison_mse.png             # Comparison plots (regression)
    ├── comparison_r2.png
    ├── comparison_accuracy.png        # Comparison plots (classification)
    ├── comparison_f1.png
    └── pr_curve_comparison.png        # Combined PR curves (classification)

Model-specific outputs (in each model subfolder):

model_<name>_seed<N>.joblib: trained models for each random seed
report_<model>.json: per-seed metrics (MSE, R², accuracy, F1, etc.)
confusion_matrix_<model>.png: confusion matrix heatmap (classification)
pr_curve_<model>.png: precision-recall curve (classification)
residual_plot_<model>.png: predicted vs actual and residuals (regression)

Multi-model comparison outputs (in the config folder root):

report_all_models.json: combined metrics across all models and seeds
comparison_*.png: side-by-side boxplots comparing model performance
pr_curve_comparison.png: overlaid precision-recall curves (classification)

Inspecting Saved Models

python inspect_models.py outputs/possum_config/random_forest/model_*.joblib

This shows model type, parameters, training iterations, and other metadata. The .joblib files contain serialised scikit-learn models that you can load and use for predictions:

import joblib
model = joblib.load('outputs/possum_config/random_forest/model_random_forest_seed42.joblib')
predictions = model.predict(X_new)  # X_new must be preprocessed the same way

Config reference (YAML)

You can train single or multiple models for comparison. See configs/penguins_config.yaml for comprehensive examples of all model types and their parameters.

Simple example (single model, classification)

problem_type: classification

data:
    path: data/sample.csv
    features: [f1, f2, habitat]
    label: label
    test_size: 0.2
    val_size: 0.2
    random_state: 0
    scaling: standard
    impute_strategy: mean

models:
  - name: mlp
    params:
      hidden_layer_sizes: [32, 16]
      max_iter: 300
      early_stopping: true

training:
    repetitions: 5
    random_seed: 0

# Optional: specify output directory (defaults to outputs/<config_name>/)
output:
    dir: outputs/my_experiment

Note: If output.dir is not specified, outputs are automatically saved to outputs/<config_name>/ where <config_name> is derived from your config file name.

Multi-output (multi-target) prediction

EcoNetToolkit supports predicting multiple target variables simultaneously (multi-output learning). This is useful when you want to predict several related outcomes from the same features.

Example: Multi-output classification

problem_type: classification

data:
    path: data/palmerpenguins_extended.csv
    features: [bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, island]
    labels: [species, sex, life_stage]  # Predict 3 labels simultaneously
    test_size: 0.2
    val_size: 0.15
    scaling: standard

models:
  - name: random_forest
    params:
      n_estimators: 100
      max_depth: 15

training:
    repetitions: 5

Example: Multi-output regression

problem_type: regression

data:
    path: data/possum.csv
    features: [hdlngth, skullw, totlngth, taill]
    labels: [age, chest, belly]  # Predict 3 continuous values
    test_size: 0.2
    scaling: standard

models:
  - name: mlp
    params:
      hidden_layer_sizes: [32, 16]
      max_iter: 500

Key points:

Use labels: (list) instead of label: (single string) to specify multiple targets
For backward compatibility, label: still works for single-output prediction
Multi-output metrics report mean and standard deviation across all outputs
Some models support multi-output natively (Random Forest, MLP Regressor), others are wrapped automatically (Logistic Regression, SVM, Linear Regression)

See configs/penguins_multilabel.yaml and configs/possum_multilabel.yaml for complete examples.

Available models and key parameters

MLP (Multi-Layer Perceptron)

hidden_layer_sizes: List of layer sizes, e.g., [32, 16]
max_iter: Maximum iterations
early_stopping: Stop when validation plateaus
n_iter_no_change: Patience - epochs to wait without improvement (default: 10)
validation_fraction: Fraction of training data for validation
alpha: L2 regularisation
learning_rate_init: Initial learning rate

Random Forest

n_estimators: Number of trees
max_depth: Max tree depth (null = unlimited)
min_samples_split: Min samples to split
max_features: Features per split (sqrt, log2, or null)

SVM (Support Vector Machine)

C: Regularisation parameter
kernel: rbf, linear, poly, or sigmoid
gamma: Kernel coefficient (scale or auto)

XGBoost

n_estimators: Boosting rounds
max_depth: Max tree depth
learning_rate: Step size (eta)
subsample: Training instance ratio
colsample_bytree: Feature ratio

Logistic Regression (classification only)

C: Inverse regularisation strength
max_iter: Max solver iterations
solver: lbfgs, liblinear, newton-cg, etc.
penalty: l1, l2, elasticnet, or null

Linear Regression (regression only)

fit_intercept: Whether to calculate the intercept (default: true)

Notes on metrics

Classification:

Primary ranking metric: Cohen's kappa (accounts for chance agreement, robust for imbalanced data)
Also reported: accuracy, balanced accuracy, precision, recall, F1, ROC-AUC, PR-AUC

Regression:

Primary ranking metric: MSE (Mean Squared Error, lower is better)
Also reported: RMSE, MAE, R², MAPE

Additional notes

For classification with two classes, ROC-AUC and PR-AUC are computed if the model can produce probabilities (e.g., MLP, RandomForest, SVM with probability=True).
For multi-class problems, macro-averaged Precision/Recall/F1 summarise performance across all classes.
Models are ranked by Cohen's kappa (classification) or MSE (regression) to identify the best performer.

Hyperparameter Tuning

EcoNetToolkit includes automated hyperparameter tuning with proper train/validation/test splits to prevent data leakage. This is especially important for ecological data with spatial or temporal structure.

Quick Example:

python run.py --config configs/mangrove_tuning.yaml

Key Features:

Grouped splits: Assign groups (e.g., patches, sites, years) to train/val/test sets
Automatic search: GridSearchCV or RandomizedSearchCV to find optimal hyperparameters
Multiple seeds: Run with different random seeds for stable results
Proper evaluation: Tune on train+val, evaluate on held-out test set

Example Config:

problem_type: regression

data:
  path: data/mangrove.csv
  cv_group_column: patch_id    # Group by spatial patches
  n_train_groups: 4            # 4 patches for training
  n_val_groups: 2              # 2 patches for validation (tuning)
  n_test_groups: 2             # 2 patches for test (final eval)
  
  labels: [NDVI]
  features: [pu_x, pu_y, temperature, ...]
  scaling: standard

# Enable hyperparameter tuning
tuning:
  enabled: true
  search_method: random       # "random" or "grid"
  n_iter: 30                  # Number of parameter combinations
  cv_folds: 3                 # CV folds during tuning
  scoring: neg_mean_squared_error
  n_jobs: -1                  # Use all CPU cores

# Define models and search spaces
models:
  - name: random_forest
    param_space:
      n_estimators: [100, 200, 500, 1000]
      max_depth: [10, 20, 30, 50]
      min_samples_split: [2, 5, 10]
      max_features: [sqrt, log2, 0.5]

training:
  repetitions: 5
  random_seed: 42

Outputs include:

Best hyperparameters for each seed
Validation and test set performance
Comparison plots
Trained models with optimal parameters

For detailed information, see docs/HYPERPARAMETER_TUNING.md

Using your own data

Place your CSV file in the data folder.
Make a yaml config file in the configs folder for your data.

Use one of the existing config files (penguin for classification; possum for regression) as a basis for your data. Change the CSV path to point to your CSV file and change the features and label parameters to match the columns in your CSV file. The parameters for the different models should be tuned for your problem.

If you are unsure how to make the yaml file, try providing ChatGPT (or your favourite LLM) with your CSV file (or the first few rows) and link to this repository and ask it to make a config file for your data. Consider data privacy before doing this.

Some tips:

Ensure your features: list includes only columns available in your CSV.
Text categories are automatically one-hot encoded.
If your dataset is very imbalanced, consider class_weight: balanced in model.params for logistic or svm, or tune scale_pos_weight for xgboost.

Testing

Testing is provided for development purposes and is used by the CI system when pull requests are created.

Unit Tests

Run the test suite to ensure everything works correctly:

python run_tests.py all -v

Or run with coverage:

python run_tests.py all -v --cov=ecosci --cov-report=html

The tests verify:

Data loading and preprocessing (scaling, encoding, splits)
Model instantiation and training for all model types
Metric computation produces sane values (0-1 ranges, no NaN/inf)
Full end-to-end pipeline runs without errors
Models produce reasonable accuracy (better than random)

End-to-End Testing

Test the full pipeline with the included example datasets:

Classification (Penguins):

python run.py --config configs/penguins_config.yaml

Regression (Possum):

python run.py --config configs/possum_config.yaml

These demonstrate that the toolkit works correctly for both problem types and generates appropriate metrics and visualisations.

Troubleshooting

Shapes or column errors: double-check your features: and label: names.
No probabilities for some models: not all models support predict_proba; plots that need probabilities are skipped automatically.

Development layout

run.py — simple entrypoint script
ecosci/ — package with modules:
- config.py (YAML reader)
- data.py (CSV loader + preprocessing)
- models.py (ModelZoo)
- trainer.py (seeded training loop, saving models)
- eval.py (metrics and plots)
configs/ — example configuration
data/ — sample CSV for quick testing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EcoNetToolkit — simple models for ecological data

Table of Contents

Getting Started

macOS and Linux (Terminal)

Windows (Anaconda)

Configure and Run

Outputs

Inspecting Saved Models

Config reference (YAML)

Simple example (single model, classification)

Multi-output (multi-target) prediction

Available models and key parameters

Notes on metrics

Additional notes

Hyperparameter Tuning

Using your own data

Testing

Unit Tests

End-to-End Testing

Troubleshooting

Development layout

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github/workflows		.github/workflows
configs		configs
data		data
ecosci		ecosci
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
inspect_models.py		inspect_models.py
requirements.txt		requirements.txt
run.py		run.py
run_tests.py		run_tests.py

License

ysims/EcoNetToolkit

Folders and files

Latest commit

History

Repository files navigation

EcoNetToolkit — simple models for ecological data

Table of Contents

Getting Started

macOS and Linux (Terminal)

Windows (Anaconda)

Configure and Run

Outputs

Inspecting Saved Models

Config reference (YAML)

Simple example (single model, classification)

Multi-output (multi-target) prediction

Available models and key parameters

Notes on metrics

Additional notes

Hyperparameter Tuning

Using your own data

Testing

Unit Tests

End-to-End Testing

Troubleshooting

Development layout

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages