3.7 KiB

Web Scraper CLI

A FastAPI-based web scraper with CLI interface.

Features

  • REST API for web scraping management
  • CLI tool for scraping websites
  • Extract metadata, links, and specific content using CSS selectors
  • Store scraping results in SQLite database
  • Background job processing
  • Rate limiting to avoid overloading target websites

Installation

Local Installation

  1. Clone the repository:
git clone https://github.com/yourusername/webscrapercli.git
cd webscrapercli
  1. Install dependencies:
pip install -r requirements.txt
  1. Run the database migrations:
alembic upgrade head

Docker Installation

  1. Clone the repository:
git clone https://github.com/yourusername/webscrapercli.git
cd webscrapercli
  1. Build and run using Docker Compose:
docker-compose up --build

This will:

  • Build the Docker image with all dependencies
  • Start the FastAPI server on port 8000
  • Mount the app and storage directories as volumes for live code reloading

Usage

API Server

Start the API server:

# Development mode
uvicorn main:app --reload

# Production mode
uvicorn main:app --host 0.0.0.0 --port 8000

Access the API documentation at: http://localhost:8000/docs

CLI Usage

The CLI provides several commands for scraping websites:

# Scrape a URL
python cli.py scrape https://example.com

# Scrape a URL with a specific selector
python cli.py scrape https://example.com --selector "div.content"

# Save the results to a file
python cli.py scrape https://example.com --output results.json

# List all scrape jobs
python cli.py list

# List scrape jobs with a specific status
python cli.py list --status completed

# Show details of a specific job
python cli.py show 1

# Run a specific job
python cli.py run 1

API Endpoints

  • GET /health: Health check endpoint
  • POST /api/v1/scrape-jobs/: Create a new scrape job
  • GET /api/v1/scrape-jobs/: List scrape jobs
  • GET /api/v1/scrape-jobs/{job_id}: Get a specific scrape job
  • PUT /api/v1/scrape-jobs/{job_id}: Update a scrape job
  • DELETE /api/v1/scrape-jobs/{job_id}: Delete a scrape job
  • POST /api/v1/scrape-jobs/{job_id}/run: Run a scrape job
  • GET /api/v1/scrape-jobs/{job_id}/results: Get the results of a scrape job

Development

Project Structure

webscrapercli/
├── alembic.ini                  # Alembic configuration
├── app/                         # Application package
│   ├── api/                     # API endpoints
│   ├── cli/                     # CLI implementation
│   ├── core/                    # Core functionality
│   ├── crud/                    # CRUD operations
│   ├── db/                      # Database configuration
│   ├── models/                  # SQLAlchemy models
│   ├── schemas/                 # Pydantic schemas
│   ├── services/                # Business logic
│   └── utils/                   # Utility functions
├── cli.py                       # CLI entry point
├── docker-compose.yml           # Docker Compose configuration
├── Dockerfile                   # Docker configuration
├── main.py                      # API entry point
├── migrations/                  # Alembic migrations
│   ├── env.py                   # Alembic environment
│   ├── script.py.mako           # Alembic script template
│   └── versions/                # Migration scripts
├── requirements.txt             # Dependencies
└── storage/                     # Storage directory for database and other files
    └── db/                      # Database directory

Running Tests

# Run tests
pytest

License

This project is open source.