> ## Documentation Index
> Fetch the complete documentation index at: https://docs.lighton.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Web Scraping Datasource

> The WebScraping datasource is a powerful tool for automatically extracting content from websites and importing it into your workspace.

<Tip>
  The WebScraping datasource is a powerful tool for automatically extracting content from websites. This guide will walk you through configuring and using WebScraper to collect data from websites and import it into your workspace.
</Tip>

## Getting Started

### 1. Creating a New WebScraper Import

To create a new import:

1. Navigate to the Datasources section
2. Click "Add New Datasource" or select an existing WebScraper datasource
3. Click "+ New Import" to configure a new import

## Basic Configuration

### URL Configuration

* **Start URL**: The specific URL where crawling will begin. This is the entry point for the scraper.
  * Example: `https://en.wikipedia.org/wiki/Artificial_intelligence`

### Crawling Parameters

* **Max Crawl Depth**: Controls how deep the crawler will navigate from the starting URL.
  * **0**: Only crawls the starting URL
  * **1**: Includes pages directly linked from the starting URL
  * **2**: Includes links from those direct links
  * **3**: Goes three levels deep (maximum)

## Advanced Configuration

* **Max Pages**: Limits the total number of pages crawled.
  * Enable "Limit Max Pages" to set a specific limit
  * Recommended for large websites to prevent excessive crawling

### Content Relevance

* **Relevance Keywords**: Keywords that determine which pages are more important to crawl.
  * Pages containing these keywords receive higher priority
  * Separate multiple keywords with commas
  * Example: `AI, machine learning, neural networks`
* **Keywords Weight**: How strongly to prioritize pages with keywords.
  * **0.0**: Ignore keywords completely
  * **1.0**: Prioritize keywords above all other factors
  * **0.7**: (Default) Balances keyword matching with other factors

### URL Patterns

* **URL Patterns to Include**: Restricts which URLs will be crawled based on patterns.
  * Use `*` as a wildcard
  * Example: `/products/*` matches all pages in the products directory
  * Use `*` alone or leave empty to include all URLs
  * Separate multiple patterns with commas
* **URL Patterns to Exclude**: Specify URL patterns that should NOT be crawled.
  * Example: `/admin/*, /login/` excludes admin pages and login page
  * Separate multiple patterns with commas

### Content Selection

* **Content CSS Selector**: CSS selector that defines which content to extract from pages.
  * This limits both crawling and content extraction scope—any content outside these selectors will be ignored.
  * Example: `article.content,.main,.data-container`
* **Elements to Exclude**: CSS selector for elements to remove from processing.
  * This works like the Content CSS Selector but in reverse—specified elements will be excluded from both markdown generation and crawling.
  * Example: `#ads, .cookies` to remove ads and cookies
* **Target Elements**: CSS selectors for specific content extraction.
  * These elements will be used for markdown generation while still allowing the crawler to process all page links and media.
  * Example: `article.content,.main,.data-container`
* **Tags to Exclude**: HTML tags to skip during content extraction.
  * These tags will be ignored during markdown generation but still checked for crawlable links.
  * Example: `nav`

### Proxy Settings

* **Enable Proxy**: Toggle to use a proxy server for web scraping requests
  * When enabled, additional proxy configuration fields will appear

## Import Settings

* **Workspace**: Select the workspace where the scraped content will be imported
* **Frequency (minutes)**: Set how often the scraper should run
  * Set to 0 for manual triggering only

## Best Practices

1. **Start Small**: Begin with a shallow crawl depth and limited pages to test
2. **Refine Gradually**: Expand your configuration after confirming initial results
3. **Use Content Selection**: Apply HTML and CSS selectors to specify which content to extract and process from pages
4. **Use Relevance Keywords**: For large sites, use keywords to prioritize content
5. **Respect Website Rules**: Avoid aggressive crawling that might overload sites
6. **Check Results**: Regularly review imported content to ensure quality

## Troubleshooting

* **Empty Results**: Check URL patterns and content selectors
* **Too Much Content**: Reduce max depth or pages, or add selection/exclusion patterns
* **Irrelevant Content**: Refine CSS selectors to target specific content areas
* **Import Failures**: Check the site's robots.txt rules or try using a proxy

## Example Configuration

For crawling Wikipedia articles about AI:

* **Start URL**: `https://en.wikipedia.org/wiki/Artificial_intelligence`
* **Max Crawl Depth**: 1
* **Max Pages**: 20
* **Relevance Keywords**: `machine learning, neural network, deep learning`
* **Keywords Weight**: 0.7
* **Content CSS Selector**: `main`
* **Element to Exclude**: `.sidebar,.vector-column-end,.vector-page-toolbar,.vector-body-before-content,.navigation-not-searchable`
