This project implements a comprehensive analysis system for Android malware samples using VirusTotal data. It processes and analyzes JSON reports from VirusTotal's API, providing multiple analytical perspectives on malware behavior, detection patterns, and relationships. The analysis results are presented through an interactive web interface that showcases various visualizations and allows access to the raw data.
All analysis results are available through an interactive website (index.html) that provides:
- Interactive geographic visualizations of malware distribution
- Comprehensive statistical analysis results
- Downloadable raw data in CSV format
- Network relationship visualizations
- Temporal analysis charts
├── index.html # Interactive web interface for viewing results
├── VT_Initial_Insert.py # Initial data ingestion into MongoDB
├── VT_Analysis.py # Main analysis and visualization scripts
├── VT_modelling.py # Machine learning models (PCA and prediction)
├── requirements.txt # Project dependencies
└── results/ # Generated analysis outputs and visualizations
- MongoDB integration for efficient data storage and querying
- Processes VirusTotal JSON reports for Android malware samples
- Handles complex nested data structures
- Geographic distribution of malware samples
- Detection rates across antivirus engines
- Temporal analysis of sample submissions
- Tag-based categorization and analysis
- Interactive geographic visualizations using Folium
- Detection rate distribution plots
- Network graphs of malware relationships
- Time series analysis visualizations
- Distribution of samples by country
- Interactive world maps with detection statistics
- Country-specific malware submission patterns
- Engine-specific detection rates
- Comparative analysis of engine effectiveness
- Detection pattern analysis using PCA
- Analysis of shared children between samples
- Network visualization of malware relationships
- Classification of file types in malware packages
- Time-based submission patterns
- Hourly and daily submission trends
- Evolution of detection rates over time
- Python 3.x
- MongoDB
- Key Python libraries:
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- folium
- networkx
- pymongo
The project uses MongoDB with the following key collections:
- Main collection storing VirusTotal reports
- Indexed fields for efficient querying
- Structured schema for nested data
- Children Prediction Model
- Random Forest Regressor
- Feature engineering from sample metadata
- Performance metrics and validation
The results/ directory contains various analysis outputs:
- CSV files with detailed analysis results
- PNG/HTML visualization files
- Statistical summaries
- Model performance reports
-
Clone the repository
-
To view the analysis results:
- Open
index.htmlin a web browser to explore the interactive visualizations and download data - All visualizations and data files are available in the
results/directory
- Open
-
To run the analysis pipeline: Install dependencies:
pip install -r requirements.txt
-
Configure MongoDB connection in
.envfile:MONGO_URI=your_mongodb_connection_string DATABASE_NAME=your_database_name COLLECTION_NAME=your_collection_name -
Run the analysis pipeline:
python VT_Initial_Insert.py # Initial data loading python VT_Analysis.py # Run main analysis python VT_modelling.py # Run ML models
- Detailed CSV reports
- Interactive maps
- Statistical visualizations
- Network graphs
- Model performance metrics
- Detection rates
- Geographic distributions
- Temporal patterns
- Malware relationships
- Feature importance
- Model accuracy metrics
- Real-time analysis capabilities
- Advanced clustering algorithms
- Deep learning model integration
- API automation
- Enhanced visualization options
This project is licensed under the MIT License - see the LICENSE file for details.