webscrapercli-wbi8nl/README.md

# Web Scraper CLI

A FastAPI-based web scraper with CLI interface.

## Features

- REST API for web scraping management
- CLI tool for scraping websites
- Extract metadata, links, and specific content using CSS selectors
- Store scraping results in SQLite database
- Background job processing
- Rate limiting to avoid overloading target websites

## Installation

### Local Installation

1. Clone the repository:

```bash
git clone https://github.com/yourusername/webscrapercli.git
cd webscrapercli
```

2. Install dependencies:

```bash
pip install -r requirements.txt
```

3. Run the database migrations:

```bash
alembic upgrade head
```

### Docker Installation

1. Clone the repository:

```bash
git clone https://github.com/yourusername/webscrapercli.git
cd webscrapercli
```

2. Build and run using Docker Compose:

```bash
docker-compose up --build
```

This will:
- Build the Docker image with all dependencies
- Start the FastAPI server on port 8000
- Mount the app and storage directories as volumes for live code reloading

## Usage

### API Server

Start the API server:

```bash
# Development mode
uvicorn main:app --reload

# Production mode
uvicorn main:app --host 0.0.0.0 --port 8000
```

Access the API documentation at: http://localhost:8000/docs

### CLI Usage

The CLI provides several commands for scraping websites:

```bash
# Scrape a URL
python cli.py scrape https://example.com

# Scrape a URL with a specific selector
python cli.py scrape https://example.com --selector "div.content"

# Save the results to a file
python cli.py scrape https://example.com --output results.json

# List all scrape jobs
python cli.py list

# List scrape jobs with a specific status
python cli.py list --status completed

# Show details of a specific job
python cli.py show 1

# Run a specific job
python cli.py run 1
```

## API Endpoints

- `GET /health`: Health check endpoint
- `POST /api/v1/scrape-jobs/`: Create a new scrape job
- `GET /api/v1/scrape-jobs/`: List scrape jobs
- `GET /api/v1/scrape-jobs/{job_id}`: Get a specific scrape job
- `PUT /api/v1/scrape-jobs/{job_id}`: Update a scrape job
- `DELETE /api/v1/scrape-jobs/{job_id}`: Delete a scrape job
- `POST /api/v1/scrape-jobs/{job_id}/run`: Run a scrape job
- `GET /api/v1/scrape-jobs/{job_id}/results`: Get the results of a scrape job

## Development

### Project Structure

```
webscrapercli/
├── alembic.ini                  # Alembic configuration
├── app/                         # Application package
│   ├── api/                     # API endpoints
│   ├── cli/                     # CLI implementation
│   ├── core/                    # Core functionality
│   ├── crud/                    # CRUD operations
│   ├── db/                      # Database configuration
│   ├── models/                  # SQLAlchemy models
│   ├── schemas/                 # Pydantic schemas
│   ├── services/                # Business logic
│   └── utils/                   # Utility functions
├── cli.py                       # CLI entry point
├── docker-compose.yml           # Docker Compose configuration
├── Dockerfile                   # Docker configuration
├── main.py                      # API entry point
├── migrations/                  # Alembic migrations
│   ├── env.py                   # Alembic environment
│   ├── script.py.mako           # Alembic script template
│   └── versions/                # Migration scripts
├── requirements.txt             # Dependencies
└── storage/                     # Storage directory for database and other files
    └── db/                      # Database directory
```

### Running Tests

```bash
# Run tests
pytest
```

## License

This project is open source.