Introducing Kandi: Your Personal Pocket Web Crawler And Search Engine
Are you tired of sifting through countless websites to find the content you care about?
Meet Kandi, your personal “pocket” web crawler and search engine, designed to make your online experience smoother and more efficient.
Kandi Versions:
Kandi 2.x Php Web Crawler With MySQL Database and Search Engine
Applications
- Website Analysis:
- SEO Audits: This crawler can be used to audit a website for SEO purposes by gathering all internal links, helping to identify broken links or paths that might need attention.
- Site Structure Review: It helps in understanding the structure of a website, including how pages are interlinked.
- Link Extraction:
- Content Aggregation: Useful for collecting all external and internal links for content aggregation or to build a comprehensive list of resources related to a topic.
- Competitor Analysis: By crawling competitor websites, you can gather data on their link structures, potentially aiding in competitive analysis.
- Web Scraping:
- Data Collection: For gathering URLs from a site to be used in further data scraping processes or to populate databases with links for specific categories or tags.
- Testing and Monitoring:
- Website Testing: Ensures that all expected links are reachable and that no pages are missing or incorrectly linked.
- Monitoring: Regularly check and monitor website links for changes or errors over time.
How It Functions
1. Initialization and Form Handling
- Form Submission: The script starts by handling form submissions where users input the URL of the website they want to crawl.
- URL Validation: It validates the provided URL to ensure it’s in the correct format before initiating the crawl.
2. Timing Measurement
- Start Time: Records the time when the crawling process begins using
microtime(true)
. - End Time: Records the time when the crawling process finishes.
- Elapsed Time Calculation: Computes the duration of the crawl by subtracting the start time from the end time. The elapsed time is formatted to two decimal places for readability.
3. Crawling Process
- Initialization:
- Queue of URLs: Begins with the start URL and initializes a queue (
$to_visit
) with this URL to track pages yet to be visited. - Visited URLs: Maintains a list of URLs that have already been visited (
$visited
) to avoid redundant processing.
- Queue of URLs: Begins with the start URL and initializes a queue (
- Crawling Loop:
- URL Processing: Iterates over the queue, fetching and processing each URL.
- HTML Content Retrieval: Uses
cURL
to fetch the HTML content of each URL. - Error Handling: Checks for errors during the retrieval process and logs any issues.
- Link Extraction: Parses the fetched HTML to extract all hyperlinks using
DOMDocument
andDOMXPath
. - URL Resolution: Resolves relative URLs into absolute URLs for consistency.
- Queue Management: Adds newly discovered links to the queue if they haven’t been visited yet and adheres to a maximum depth to prevent excessive crawling.
- Depth Limitation: The script limits the crawl depth (
$max_depth
) to prevent it from endlessly traversing the site or crawling too deeply into the web structure.
4. Result Compilation
- Results Storage: Collects and stores results of each crawled URL, including any errors encountered.
- Page Counting: Counts the total number of pages crawled by measuring the size of the
$pages
array. - Message Formation: Constructs a message summarizing the number of pages crawled and the elapsed time.
5. Displaying Results
- HTML Output: Displays the results on a web page, including:
- Crawling Results: Lists all the crawled URLs and their statuses.
- Error Messages: Shows any errors encountered during the crawl.
- Performance Summary: Provides a summary of the total number of pages crawled and the time taken.
Code Breakdown
- Functions:
get_html_content($url)
: Fetches and returns HTML content from the given URL.get_links($html, $base_url)
: Extracts and resolves URLs from the HTML content.resolve_url($relative_url, $base_url)
: Converts relative URLs to absolute URLs based on the base URL.crawl_website($start_url, $max_depth)
: Main function that performs the crawling, following links up to a specified depth.
- Error Handling: Manages various errors such as
cURL
errors or HTTP errors and reports them appropriately. - User Interface: Provides a form for users to input the URL and displays the results after the crawl completes, including error messages and the performance summary.