Documentation Index
Fetch the complete documentation index at: https://docs.lighton.ai/llms.txt
Use this file to discover all available pages before exploring further.
Getting Started
1. Creating a New WebScraper Import
To create a new import:- Navigate to the Datasources section
- Click “Add New Datasource” or select an existing WebScraper datasource
- Click ”+ New Import” to configure a new import
Basic Configuration
URL Configuration
- Start URL: The specific URL where crawling will begin. This is the entry point for the scraper.
- Example:
https://en.wikipedia.org/wiki/Artificial_intelligence
- Example:
Crawling Parameters
- Max Crawl Depth: Controls how deep the crawler will navigate from the starting URL.
- 0: Only crawls the starting URL
- 1: Includes pages directly linked from the starting URL
- 2: Includes links from those direct links
- 3: Goes three levels deep (maximum)
Advanced Configuration
- Max Pages: Limits the total number of pages crawled.
- Enable “Limit Max Pages” to set a specific limit
- Recommended for large websites to prevent excessive crawling
Content Relevance
- Relevance Keywords: Keywords that determine which pages are more important to crawl.
- Pages containing these keywords receive higher priority
- Separate multiple keywords with commas
- Example:
AI, machine learning, neural networks
- Keywords Weight: How strongly to prioritize pages with keywords.
- 0.0: Ignore keywords completely
- 1.0: Prioritize keywords above all other factors
- 0.7: (Default) Balances keyword matching with other factors
URL Patterns
- URL Patterns to Include: Restricts which URLs will be crawled based on patterns.
- Use
*as a wildcard - Example:
/products/*matches all pages in the products directory - Use
*alone or leave empty to include all URLs - Separate multiple patterns with commas
- Use
- URL Patterns to Exclude: Specify URL patterns that should NOT be crawled.
- Example:
/admin/*, /login/excludes admin pages and login page - Separate multiple patterns with commas
- Example:
Content Selection
- Content CSS Selector: CSS selector that defines which content to extract from pages.
- This limits both crawling and content extraction scope—any content outside these selectors will be ignored.
- Example:
article.content,.main,.data-container
- Elements to Exclude: CSS selector for elements to remove from processing.
- This works like the Content CSS Selector but in reverse—specified elements will be excluded from both markdown generation and crawling.
- Example:
#ads, .cookiesto remove ads and cookies
- Target Elements: CSS selectors for specific content extraction.
- These elements will be used for markdown generation while still allowing the crawler to process all page links and media.
- Example:
article.content,.main,.data-container
- Tags to Exclude: HTML tags to skip during content extraction.
- These tags will be ignored during markdown generation but still checked for crawlable links.
- Example:
nav
Proxy Settings
- Enable Proxy: Toggle to use a proxy server for web scraping requests
- When enabled, additional proxy configuration fields will appear
Import Settings
- Workspace: Select the workspace where the scraped content will be imported
- Frequency (minutes): Set how often the scraper should run
- Set to 0 for manual triggering only
Best Practices
- Start Small: Begin with a shallow crawl depth and limited pages to test
- Refine Gradually: Expand your configuration after confirming initial results
- Use Content Selection: Apply HTML and CSS selectors to specify which content to extract and process from pages
- Use Relevance Keywords: For large sites, use keywords to prioritize content
- Respect Website Rules: Avoid aggressive crawling that might overload sites
- Check Results: Regularly review imported content to ensure quality
Troubleshooting
- Empty Results: Check URL patterns and content selectors
- Too Much Content: Reduce max depth or pages, or add selection/exclusion patterns
- Irrelevant Content: Refine CSS selectors to target specific content areas
- Import Failures: Check the site’s robots.txt rules or try using a proxy
Example Configuration
For crawling Wikipedia articles about AI:- Start URL:
https://en.wikipedia.org/wiki/Artificial_intelligence - Max Crawl Depth: 1
- Max Pages: 20
- Relevance Keywords:
machine learning, neural network, deep learning - Keywords Weight: 0.7
- Content CSS Selector:
main - Element to Exclude:
.sidebar,.vector-column-end,.vector-page-toolbar,.vector-body-before-content,.navigation-not-searchable