Domain Crawler and Comparison Tool

Domain Crawler and Comparison Tool

This repository contains Python scripts to crawl and compare a website for changes.

Table of Contents

Overview

  1. capture.py – A web crawler that goes through all the pages of a given domain and exports the URLs, status codes, page sizes, and heights into both .txt and .html formats.
  2. compare.py – A script that takes two .txt files (generated by crawl_website.py), representing the old and new versions of a website, and compares them side by side. It exports the differences into an HTML file, highlighting the discrepancies.

Installation

Installing Python

  • Windows: Download the installer from Python’s official site and follow the installation steps. Make sure to check the “Add Python to PATH” checkbox during installation.
  • macOS: Python comes pre-installed on macOS, but you can also download the latest version from Python’s official site.
  • Linux: Use your distribution’s package manager to install Python. For example, on Ubuntu:

sudo apt-get update sudo apt-get install python3

Installing Requirements

After installing Python, you need to install the required packages. Navigate to the project folder in your terminal and run:pip install -r requirements.txt

How to operate

  1. Crawling a Website:
python capture.pyCode language: CSS (css)

Follow the prompts to enter the website domain and select the type of crawl. The output of the capture will be listed in ./captures/

  1. Comparing Websites:
python compare.pyCode language: CSS (css)

Follow the prompts to select the .txt files to be compared. The final report will be generated in ./compares/

Why use this tool?

  • Migrating to a New Platform/Host: Before switching to a new platform or hosting service, you may want to ensure that all URLs from the old platform exist in the new platform and function as expected.
  • Switching WordPress Themes: A change in theme may result in differences in content display, load times, or even broken links. Comparing the website before and after the switch can highlight these issues.
  • SEO Analysis: Ensuring that URLs, especially high-traffic ones, remain consistent during any changes can help preserve SEO rankings.
  • Quality Assurance: Before rolling out a redesigned website, comparing the old and new sites can help identify bugs, missing content, or other issues that need to be addressed.

How to Download

Py Domain Crawler and Comparison Tool is available on GitHub.

https://github.com/gbti-labs/py-domain-crawler-and-comparison-tool