Spider - Web Crawler and Wordlist / Ngram Generator
-
Author:
cyclone
URL:
https://github.com/cyclone-github/spider
Description:
Spider is a web crawler and wordlist/ngram generator written in Go that crawls specified URLs or local files to produce frequency-sorted wordlists and ngrams. Users can customize crawl depth, output files, frequency sort, and ngram options, making it ideal for web scraping to create targeted wordlists for tools like hashcat or John the Ripper. Spider combines the web scraping capabilities of CeWL and adds ngram generation, and since Spider is written in Go, it requires no additional libraries to download or install.
Spider just works.Spider: URL Mode
spider -url 'https://forum.hashpwn.net' -crawl 2 -delay 20 -sort -ngram 1-3 -timeout 1 -url-match wordlist -o forum.hashpwn.net_spider.txt
---------------------- | Cyclone's URL Spider | ---------------------- Crawling URL: https://forum.hashpwn.net Base domain: forum.hashpwn.net Crawl depth: 2 ngram len: 1-3 Crawl delay: 20ms (increase this to avoid rate limiting) Timeout: 1 sec URLs crawled: 2 Processing... [====================] 100.00% Unique words: 475 Unique ngrams: 1977 Sorting n-grams by frequency... Writing... [====================] 100.00% Output file: forum.hashpwn.net_spider.txt RAM used: 0.02 GB Runtime: 2.283s
Spider: File Mode
spider -file kjv_bible.txt -sort -ngram 1-3
---------------------- | Cyclone's URL Spider | ---------------------- Reading file: kjv_bible.txt ngram len: 1-3 Processing... [====================] 100.00% Unique words: 35412 Unique ngrams: 877394 Sorting n-grams by frequency... Writing... [====================] 100.00% Output file: kjv_bible_spider.txt RAM used: 0.13 GB Runtime: 1.359s
Wordlist & ngram creation tool to crawl a given url or process a local file to create wordlists and/or ngrams (depending on flags given).
Usage Instructions:
- To create a simple wordlist from a specified url (will save deduplicated wordlist to url_spider.txt):
spider -url 'https://github.com/cyclone-github'
- To set url crawl url depth of 2 and create ngrams len 1-5, use flag "-crawl 2" and "-ngram 1-5"
spider -url 'https://github.com/cyclone-github' -crawl 2 -ngram 1-5
- To set a custom output file, use flag "-o filename"
spider -url 'https://github.com/cyclone-github' -o wordlist.txt
- To set a delay to keep from being rate-limited, use flag "-delay nth" where nth is time in milliseconds
spider -url 'https://github.com/cyclone-github' -delay 100
- To set a URL timeout, use flag "-timeout nth" where nth is time in seconds
spider -url 'https://github.com/cyclone-github' -timeout 2
- To create ngrams len 1-3 and sort output by frequency, use "-ngram 1-3" "-sort"
spider -url 'https://github.com/cyclone-github' -ngram 1-3 -sort
- To filter crawled URLs by keyword "foobar"
spider -url 'https://github.com/cyclone-github' -url-match foobar
- To process a local text file, create ngrams len 1-3 and sort output by frequency
spider -file foobar.txt -ngram 1-3 -sort
- Run
spider -help
to see a list of all options
spider -help
-crawl int Depth of links to crawl (default 1) -cyclone Display coded message -delay int Delay in ms between each URL lookup to avoid rate limiting (default 10) -file string Path to a local file to scrape -url-match string Only crawl URLs containing this keyword (case-insensitive) -ngram string Lengths of n-grams (e.g., "1-3" for 1, 2, and 3-length n-grams). (default "1") -o string Output file for the n-grams -sort Sort output by frequency -timeout int Timeout for URL crawling in seconds (default 1) -url string URL of the website to scrape -version Display version
Compile from source:
- If you want the latest features, compiling from source is the best option since the release version may run several revisions behind the source code.
- This assumes you have Go and Git installed
git clone https://github.com/cyclone-github/spider.git
# clone repocd spider
# enter project directorygo mod init spider
# initialize Go module (skips if go.mod exists)go mod tidy
# download dependenciesgo build -ldflags="-s -w" .
# compile binary in current directorygo install -ldflags="-s -w" .
# compile binary and install to $GOPATH
- Compile from source code how-to:
Changelog:
Mentions:
- Go Package Documentation: https://pkg.go.dev/github.com/cyclone-github/spider
- Softpedia: https://www.softpedia.com/get/Internet/Other-Internet-Related/Cyclone-s-URL-Spider.shtml
- To create a simple wordlist from a specified url (will save deduplicated wordlist to url_spider.txt):
-
v0.8.0
New code pushed to GitHub.added flag "-file" to allow creating ngrams from a local plaintext file (ex: foobar.txt) added flag "-timeout" for -url mode added flag "-sort" which sorts output by frequency fixed several small bugs
You can also use
-file
and-sort
flags to frequency sort and dedup wordlists that contain dups.
ex:spider -file foobar.txt -sort
This optimizes wordlists by sorting them by probability with the most frequent occurring words being listed at the top.Keep in mind:
- This only applies to wordlists which contain dups
- Sorting large wordlists is RAM intensive
- This feature is beta, so results may vary
-
New Release:
v0.9.0
:
https://github.com/cyclone-github/spider/releases/tag/v0.9.0Changelog:
- v0.9.0 by cyclone-github in https://github.com/cyclone-github/spider/pull/7
- added flag "-url-match" to only crawl URLs containing a specified keyword; https://github.com/cyclone-github/spider/issues/6
- added notice to user if no URLs are crawled when using "-crawl 1 -url-match"
- exit early if zero URLs were crawled (no processing or file output)
- use custom User-Agent "Spider/0.9.0 (+https://github.com/cyclone-github/spider)"
- removed clearScreen function and its imports
- fixed crawl-depth calculation logic
- fixed restrict link collection to .html, .htm, .txt and extension-less paths
- upgraded dependencies and bumped Go version to v1.24.3
- v0.9.0 by cyclone-github in https://github.com/cyclone-github/spider/pull/7