This guide will help you set up your development environment and get started with the projects in this bootcamp.
Make sure you have the following tools installed before you begin:
| Tool | Version | Notes |
|---|---|---|
| Python | >= 3.10 | Use pyenv to manage versions (recommended) |
| Node.js | >= 20.9.0 | Use nvm to manage versions (recommended) |
| npm | Bundled with Node.js | — |
| uv | Latest | Python package manager |
| Git | Latest | Version control |
| VS Code | Latest | Recommended editor — download here |
| Postman | Latest | API testing (optional but recommended) |
| Docker | Latest | Containerization (optional but recommended) |
Inside the project root directory Bootcamp
Create a virtual environment using uv which will later be used in jupyter notebooks.
uv venv --python 3.12
The scraper collects Samsung phone listings from Amazon — pulling key specs (name, price, brand, OS, RAM, CPU details, ratings count, URL) from both search result pages and individual product pages.
Stack: Scrapy for crawling + Selenium (Chrome WebDriver) for rendering dynamic content.
⚠️ Amazon may block automated traffic, show CAPTCHAs, or return partial data. Always comply with the site's Terms of Service and applicable laws.
data_acquisition/
├── amazon_samsung/
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders/
│ ├── __init__.py
│ └── samsung_phones.py ← spider must live here
├── env/ # optional virtual environment
├── scrapy.cfg
└── samsung_phones_specs.csv # example output
- Python 3.9+ (3.10+ recommended)
- Google Chrome installed
- Python packages:
scrapy,selenium,webdriver-manager,scrapy-fake-useragent
# Navigate to the data acquisition directory
cd data_acquisition
# Create and activate a virtual environment
uv venv --python 3.12
#activate
source .venv/bin/activate
# Install dependencies
uv pip install scrapy selenium webdriver-manager scrapy-fake-useragent# Export to CSV (overwrites existing file)
uv run scrapy crawl samsung_phones -O samsung_phones_specs.csv
# Export to JSON
uv run scrapy crawl samsung_phones -O samsung_phones_specs.jsonUse
-oinstead of-Oto append to an existing file rather than overwrite it.
In amazon_samsung/spiders/samsung_phones.py:
max_pages = 10 # number of Amazon search pages to crawl
search_url = "https://www.amazon.com/s?k=samsung+phone&page={page}"To run Chrome without opening a browser window, uncomment this line in the spider:
# opts.add_argument("--headless=new"){
"name": "Samsung Galaxy ...",
"price": "$199.99",
"brand": "SAMSUNG",
"operating_system": "Android",
"ram": "8 GB",
"cpu_model": "Snapdragon ...",
"cpu_speed": "3.2 GHz",
"ratings_count": "1234",
"url": "https://www.amazon.com/..."
}- Amazon page structure varies by region, account, and session — some selectors may fail intermittently.
- The spider uses a single shared Selenium driver, so concurrency is limited to 1 request at a time.
- Scraping too fast may trigger blocks or CAPTCHAs — be respectful of rate limits.
The scraped data is raw data which needs to be cleaned before we can use it for our mobile phone data analyzer system, so we will perform data cleaning and normalization to make this data ready for our further AI operations.
- Navigate to data_processing and populate the .env file considering .env.example.
- Run the cells in etl.ipynb to run the extract, transform, normalize and load the raw data into our postgresql database.
# Navigate to the frontend directory
cd data_bot/ui
# Copy the example env file and fill in your values
cp .env.example .env
# Install dependencies
npm install
# Start the development server
npm run devOpen http://localhost:3000 in your browser to view the app.
# Navigate to the backend directory
cd data_bot/analytics_agent
# Copy the example env file and fill in your values
cp .env.example .envEdit the .env file with your credentials:
LANGFUSE_SECRET_KEY=your_secret_key
LANGFUSE_PUBLIC_KEY=your_public_key
LANGFUSE_BASE_URL=https://cloud.langfuse.com
GROQ_API_KEY=your_groq_key
DATA_BASE_PATH=
POSTGRES_USER=postgres
POSTGRES_PASSWORD=root
POSTGRES_DB=postgres
POSTGRES_HOST=localhost
POSTGRES_PORT=5432# Install dependencies
uv sync && uv sync --dev
# Start the server
uv run --env-file .env src/main.py
#If you have make installed locally, you can also use make commands to run this directly
make run_ui_backendThe backend will be available at http://localhost:3050. You can test the API using Postman or any HTTP client.
Docker is used for containerization of the application backend file, and we use the following commands for dockerization. We containerize an app and then use it in production.
To build the image
docker build -t analytics_agent .
To run the container
docker run --env-file .env -p 3050:3050 --name analytics_agent analytics_agent
You can also use docker compose to run the container
docker compose up --build