EcoNetToolkit lets you train a shallow neural network or classical models on your tabular ecological data using a simple YAML file.
- CSV input with automatic preprocessing (impute, scale, encode)
- Model zoo: MLP (shallow), Random Forest, SVM, XGBoost, Logistic Regression, Linear Regression
- Hyperparameter tuning with grouped train/val/test splits (prevents data leakage)
- Repeated training with different seeds for stable estimates
- Metrics, including for unbalanced datasets (balanced accuracy, PR AUC)
- K-fold cross-validation with spatial/temporal grouping
- Configure the project from a single config file
- EcoNetToolkit — simple models for ecological data
For these steps, open a new terminal and enter the commands in the command line.
-
Clone the repository and move into the directory:
git clone https://github.com/ysims/EcoNetToolkit.git cd EcoNetToolkit -
In your terminal, run:
python3 -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip pip install -r requirements.txtTo leave the venv later, run
deactivate.
If you have already followed these steps during a previous session, reactivate the virtual environment by opening a terminal in the EcoNetToolkit directory and run:
source .venv/bin/activateInstall Anaconda (Windows 64‑bit) from the official website, using the default settings. After installation, open 'Anaconda Prompt' from the Start Menu. In the prompt, run the following steps.
-
Get Git (this may already be installed):
conda install git
-
Clone the repository and move into the directory:
git clone https://github.com/ysims/EcoNetToolkit.git cd EcoNetToolkit -
Create the conda environment and activate it:
conda env create -f environment.yml conda activate econet
If the conda command isn’t recognised, make sure you’re in the Anaconda Prompt.
If you have already followed these steps during a previous session, reactivate the conda environment by opening an Anaconda Prompt in the EcoNetToolkit directory and run:
conda activate econetAll commands should be run in the terminal (macOS and Linux) or the Anaconda prompt (Windows).
EcoNetToolkit includes two example datasets to help you get started.
Classification Example: Palmer Penguins
Predict penguin species from morphological measurements:
python run.py --config configs/penguins_config.yamlThis demonstrates multi-class classification (3 species: Adelie, Chinstrap, Gentoo) using features like bill length, flipper length, and body mass.
Regression Example: Possum Morphology
Predict possum age from morphological measurements:
python run.py --config configs/possum_config.yamlThis demonstrates continuous variable prediction using head length, skull width, and other physical measurements.
Outputs are organised into folders based on your config file name. For example, running configs/possum_config.yaml creates:
outputs/
└── possum_config/ # Named after your config file
├── random_forest/ # Model-specific subfolder
│ ├── model_random_forest_seed42.joblib
│ ├── model_random_forest_seed43.joblib
│ ├── ...
│ ├── report_random_forest.json
│ ├── confusion_matrix_random_forest.png (classification only)
│ ├── pr_curve_random_forest.png (classification only)
│ └── residual_plot_random_forest.png (regression only)
├── xgboost/
├── mlp/
├── svm/
├── linear/
├── report_all_models.json # Combined results across all models
├── comparison_mse.png # Comparison plots (regression)
├── comparison_r2.png
├── comparison_accuracy.png # Comparison plots (classification)
├── comparison_f1.png
└── pr_curve_comparison.png # Combined PR curves (classification)
Model-specific outputs (in each model subfolder):
model_<name>_seed<N>.joblib: trained models for each random seedreport_<model>.json: per-seed metrics (MSE, R², accuracy, F1, etc.)confusion_matrix_<model>.png: confusion matrix heatmap (classification)pr_curve_<model>.png: precision-recall curve (classification)residual_plot_<model>.png: predicted vs actual and residuals (regression)
Multi-model comparison outputs (in the config folder root):
report_all_models.json: combined metrics across all models and seedscomparison_*.png: side-by-side boxplots comparing model performancepr_curve_comparison.png: overlaid precision-recall curves (classification)
python inspect_models.py outputs/possum_config/random_forest/model_*.joblibThis shows model type, parameters, training iterations, and other metadata. The .joblib files contain serialised scikit-learn models that you can load and use for predictions:
import joblib
model = joblib.load('outputs/possum_config/random_forest/model_random_forest_seed42.joblib')
predictions = model.predict(X_new) # X_new must be preprocessed the same wayYou can train single or multiple models for comparison. See configs/penguins_config.yaml for comprehensive examples of all model types and their parameters.
problem_type: classification
data:
path: data/sample.csv
features: [f1, f2, habitat]
label: label
test_size: 0.2
val_size: 0.2
random_state: 0
scaling: standard
impute_strategy: mean
models:
- name: mlp
params:
hidden_layer_sizes: [32, 16]
max_iter: 300
early_stopping: true
training:
repetitions: 5
random_seed: 0
# Optional: specify output directory (defaults to outputs/<config_name>/)
output:
dir: outputs/my_experimentNote: If output.dir is not specified, outputs are automatically saved to outputs/<config_name>/ where <config_name> is derived from your config file name.
EcoNetToolkit supports predicting multiple target variables simultaneously (multi-output learning). This is useful when you want to predict several related outcomes from the same features.
Example: Multi-output classification
problem_type: classification
data:
path: data/palmerpenguins_extended.csv
features: [bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, island]
labels: [species, sex, life_stage] # Predict 3 labels simultaneously
test_size: 0.2
val_size: 0.15
scaling: standard
models:
- name: random_forest
params:
n_estimators: 100
max_depth: 15
training:
repetitions: 5Example: Multi-output regression
problem_type: regression
data:
path: data/possum.csv
features: [hdlngth, skullw, totlngth, taill]
labels: [age, chest, belly] # Predict 3 continuous values
test_size: 0.2
scaling: standard
models:
- name: mlp
params:
hidden_layer_sizes: [32, 16]
max_iter: 500Key points:
- Use
labels:(list) instead oflabel:(single string) to specify multiple targets - For backward compatibility,
label:still works for single-output prediction - Multi-output metrics report mean and standard deviation across all outputs
- Some models support multi-output natively (Random Forest, MLP Regressor), others are wrapped automatically (Logistic Regression, SVM, Linear Regression)
See configs/penguins_multilabel.yaml and configs/possum_multilabel.yaml for complete examples.
MLP (Multi-Layer Perceptron)
hidden_layer_sizes: List of layer sizes, e.g.,[32, 16]max_iter: Maximum iterationsearly_stopping: Stop when validation plateausn_iter_no_change: Patience - epochs to wait without improvement (default: 10)validation_fraction: Fraction of training data for validationalpha: L2 regularisationlearning_rate_init: Initial learning rate
Random Forest
n_estimators: Number of treesmax_depth: Max tree depth (null= unlimited)min_samples_split: Min samples to splitmax_features: Features per split (sqrt,log2, ornull)
SVM (Support Vector Machine)
C: Regularisation parameterkernel:rbf,linear,poly, orsigmoidgamma: Kernel coefficient (scaleorauto)
XGBoost
n_estimators: Boosting roundsmax_depth: Max tree depthlearning_rate: Step size (eta)subsample: Training instance ratiocolsample_bytree: Feature ratio
Logistic Regression (classification only)
C: Inverse regularisation strengthmax_iter: Max solver iterationssolver:lbfgs,liblinear,newton-cg, etc.penalty:l1,l2,elasticnet, ornull
Linear Regression (regression only)
fit_intercept: Whether to calculate the intercept (default:true)
Classification:
- Primary ranking metric: Cohen's kappa (accounts for chance agreement, robust for imbalanced data)
- Also reported: accuracy, balanced accuracy, precision, recall, F1, ROC-AUC, PR-AUC
Regression:
- Primary ranking metric: MSE (Mean Squared Error, lower is better)
- Also reported: RMSE, MAE, R², MAPE
- For classification with two classes, ROC-AUC and PR-AUC are computed if the model can produce probabilities (e.g., MLP, RandomForest, SVM with
probability=True). - For multi-class problems, macro-averaged Precision/Recall/F1 summarise performance across all classes.
- Models are ranked by Cohen's kappa (classification) or MSE (regression) to identify the best performer.
EcoNetToolkit includes automated hyperparameter tuning with proper train/validation/test splits to prevent data leakage. This is especially important for ecological data with spatial or temporal structure.
Quick Example:
python run.py --config configs/mangrove_tuning.yamlKey Features:
- Grouped splits: Assign groups (e.g., patches, sites, years) to train/val/test sets
- Automatic search: GridSearchCV or RandomizedSearchCV to find optimal hyperparameters
- Multiple seeds: Run with different random seeds for stable results
- Proper evaluation: Tune on train+val, evaluate on held-out test set
Example Config:
problem_type: regression
data:
path: data/mangrove.csv
cv_group_column: patch_id # Group by spatial patches
n_train_groups: 4 # 4 patches for training
n_val_groups: 2 # 2 patches for validation (tuning)
n_test_groups: 2 # 2 patches for test (final eval)
labels: [NDVI]
features: [pu_x, pu_y, temperature, ...]
scaling: standard
# Enable hyperparameter tuning
tuning:
enabled: true
search_method: random # "random" or "grid"
n_iter: 30 # Number of parameter combinations
cv_folds: 3 # CV folds during tuning
scoring: neg_mean_squared_error
n_jobs: -1 # Use all CPU cores
# Define models and search spaces
models:
- name: random_forest
param_space:
n_estimators: [100, 200, 500, 1000]
max_depth: [10, 20, 30, 50]
min_samples_split: [2, 5, 10]
max_features: [sqrt, log2, 0.5]
training:
repetitions: 5
random_seed: 42Outputs include:
- Best hyperparameters for each seed
- Validation and test set performance
- Comparison plots
- Trained models with optimal parameters
For detailed information, see docs/HYPERPARAMETER_TUNING.md
-
Place your CSV file in the
datafolder. -
Make a
yamlconfig file in theconfigsfolder for your data.Use one of the existing config files (penguin for classification; possum for regression) as a basis for your data. Change the CSV path to point to your CSV file and change the features and label parameters to match the columns in your CSV file. The parameters for the different models should be tuned for your problem.
If you are unsure how to make the
yamlfile, try providing ChatGPT (or your favourite LLM) with your CSV file (or the first few rows) and link to this repository and ask it to make a config file for your data. Consider data privacy before doing this.
Some tips:
- Ensure your
features:list includes only columns available in your CSV. - Text categories are automatically one-hot encoded.
- If your dataset is very imbalanced, consider
class_weight: balancedinmodel.paramsforlogisticorsvm, or tunescale_pos_weightforxgboost.
Testing is provided for development purposes and is used by the CI system when pull requests are created.
Run the test suite to ensure everything works correctly:
python run_tests.py all -vOr run with coverage:
python run_tests.py all -v --cov=ecosci --cov-report=htmlThe tests verify:
- Data loading and preprocessing (scaling, encoding, splits)
- Model instantiation and training for all model types
- Metric computation produces sane values (0-1 ranges, no NaN/inf)
- Full end-to-end pipeline runs without errors
- Models produce reasonable accuracy (better than random)
Test the full pipeline with the included example datasets:
Classification (Penguins):
python run.py --config configs/penguins_config.yamlRegression (Possum):
python run.py --config configs/possum_config.yamlThese demonstrate that the toolkit works correctly for both problem types and generates appropriate metrics and visualisations.
- Shapes or column errors: double-check your
features:andlabel:names. - No probabilities for some models: not all models support
predict_proba; plots that need probabilities are skipped automatically.
run.py— simple entrypoint scriptecosci/— package with modules:config.py(YAML reader)data.py(CSV loader + preprocessing)models.py(ModelZoo)trainer.py(seeded training loop, saving models)eval.py(metrics and plots)
configs/— example configurationdata/— sample CSV for quick testing