Web Scraper CLI

A FastAPI-based web scraper with CLI interface.

Features

REST API for web scraping management
CLI tool for scraping websites
Extract metadata, links, and specific content using CSS selectors
Store scraping results in SQLite database
Background job processing
Rate limiting to avoid overloading target websites

Installation

Local Installation

Clone the repository:

git clone https://github.com/yourusername/webscrapercli.git
cd webscrapercli

Install dependencies:

pip install -r requirements.txt

Run the database migrations:

alembic upgrade head

Docker Installation

Clone the repository:

git clone https://github.com/yourusername/webscrapercli.git
cd webscrapercli

Build and run using Docker Compose:

docker-compose up --build

This will:

Build the Docker image with all dependencies
Start the FastAPI server on port 8000
Mount the app and storage directories as volumes for live code reloading

Usage

API Server

Start the API server:

# Development mode
uvicorn main:app --reload

# Production mode
uvicorn main:app --host 0.0.0.0 --port 8000

Access the API documentation at: http://localhost:8000/docs

CLI Usage

The CLI provides several commands for scraping websites:

# Scrape a URL
python cli.py scrape https://example.com

# Scrape a URL with a specific selector
python cli.py scrape https://example.com --selector "div.content"

# Save the results to a file
python cli.py scrape https://example.com --output results.json

# List all scrape jobs
python cli.py list

# List scrape jobs with a specific status
python cli.py list --status completed

# Show details of a specific job
python cli.py show 1

# Run a specific job
python cli.py run 1

API Endpoints

GET /health: Health check endpoint
POST /api/v1/scrape-jobs/: Create a new scrape job
GET /api/v1/scrape-jobs/: List scrape jobs
GET /api/v1/scrape-jobs/{job_id}: Get a specific scrape job
PUT /api/v1/scrape-jobs/{job_id}: Update a scrape job
DELETE /api/v1/scrape-jobs/{job_id}: Delete a scrape job
POST /api/v1/scrape-jobs/{job_id}/run: Run a scrape job
GET /api/v1/scrape-jobs/{job_id}/results: Get the results of a scrape job

Development

Project Structure

webscrapercli/
├── alembic.ini                  # Alembic configuration
├── app/                         # Application package
│   ├── api/                     # API endpoints
│   ├── cli/                     # CLI implementation
│   ├── core/                    # Core functionality
│   ├── crud/                    # CRUD operations
│   ├── db/                      # Database configuration
│   ├── models/                  # SQLAlchemy models
│   ├── schemas/                 # Pydantic schemas
│   ├── services/                # Business logic
│   └── utils/                   # Utility functions
├── cli.py                       # CLI entry point
├── docker-compose.yml           # Docker Compose configuration
├── Dockerfile                   # Docker configuration
├── main.py                      # API entry point
├── migrations/                  # Alembic migrations
│   ├── env.py                   # Alembic environment
│   ├── script.py.mako           # Alembic script template
│   └── versions/                # Migration scripts
├── requirements.txt             # Dependencies
└── storage/                     # Storage directory for database and other files
    └── db/                      # Database directory

Running Tests

# Run tests
pytest

License

This project is open source.

3.7 KiB Raw Permalink Blame History