Skip to content
  • Categories
  • Recent
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Slate)
  • No Skin
Collapse
Brand Logo

hashpwn

Home | Donate | GitHub | Matrix Chat | PrivateBin | Rules

  1. Home
  2. Tools
  3. Network & IP
  4. Spider - Web Crawler and Wordlist / Ngram Generator

Spider - Web Crawler and Wordlist / Ngram Generator

Scheduled Pinned Locked Moved Network & IP
3 Posts 1 Posters 735 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • cycloneC Online
    cycloneC Online
    cyclone
    Admin Trusted
    wrote on last edited by cyclone
    #1

    Author: cyclone
    URL: https://github.com/cyclone-github/spider
    Description:
    Spider is a web crawler and wordlist/ngram generator written in Go that crawls specified URLs or local files to produce frequency-sorted wordlists and ngrams. Users can customize crawl depth, output files, frequency sort, and ngram options, making it ideal for web scraping to create targeted wordlists for tools like hashcat or John the Ripper. Spider combines the web scraping capabilities of CeWL and adds ngram generation, and since Spider is written in Go, it requires no additional libraries to download or install.
    Spider just works.

    Spider: URL Mode

    spider -url 'https://forum.hashpwn.net' -crawl 2 -delay 20 -sort -ngram 1-3 -timeout 1 -url-match wordlist -o forum.hashpwn.net_spider.txt
    
     ---------------------- 
    | Cyclone's URL Spider |
     ---------------------- 
    
    Crawling URL:   https://forum.hashpwn.net
    Base domain:    forum.hashpwn.net
    Crawl depth:    2
    ngram len:      1-3
    Crawl delay:    20ms (increase this to avoid rate limiting)
    Timeout:        1 sec
    URLs crawled:   2
    Processing...   [====================] 100.00%
    Unique words:   475
    Unique ngrams:  1977
    Sorting n-grams by frequency...
    Writing...      [====================] 100.00%
    Output file:    forum.hashpwn.net_spider.txt
    RAM used:       0.02 GB
    Runtime:        2.283s
    

    Spider: File Mode

    spider -file kjv_bible.txt -sort -ngram 1-3
    
     ---------------------- 
    | Cyclone's URL Spider |
     ---------------------- 
    
    Reading file:   kjv_bible.txt
    ngram len:      1-3
    Processing...   [====================] 100.00%
    Unique words:   35412
    Unique ngrams:  877394
    Sorting n-grams by frequency...
    Writing...      [====================] 100.00%
    Output file:    kjv_bible_spider.txt
    RAM used:       0.13 GB
    Runtime:        1.359s
    

    Wordlist & ngram creation tool to crawl a given url or process a local file to create wordlists and/or ngrams (depending on flags given).

    Usage Instructions:

    • To create a simple wordlist from a specified url (will save deduplicated wordlist to url_spider.txt):
      • spider -url 'https://github.com/cyclone-github'
    • To set url crawl url depth of 2 and create ngrams len 1-5, use flag "-crawl 2" and "-ngram 1-5"
      • spider -url 'https://github.com/cyclone-github' -crawl 2 -ngram 1-5
    • To set a custom output file, use flag "-o filename"
      • spider -url 'https://github.com/cyclone-github' -o wordlist.txt
    • To set a delay to keep from being rate-limited, use flag "-delay nth" where nth is time in milliseconds
      • spider -url 'https://github.com/cyclone-github' -delay 100
    • To set a URL timeout, use flag "-timeout nth" where nth is time in seconds
      • spider -url 'https://github.com/cyclone-github' -timeout 2
    • To create ngrams len 1-3 and sort output by frequency, use "-ngram 1-3" "-sort"
      • spider -url 'https://github.com/cyclone-github' -ngram 1-3 -sort
    • To filter crawled URLs by keyword "foobar"
      • spider -url 'https://github.com/cyclone-github' -url-match foobar
    • To process a local text file, create ngrams len 1-3 and sort output by frequency
      • spider -file foobar.txt -ngram 1-3 -sort
    • Run spider -help to see a list of all options

    spider -help

      -crawl int
            Depth of links to crawl (default 1)
      -cyclone
            Display coded message
      -delay int
            Delay in ms between each URL lookup to avoid rate limiting (default 10)
      -file string
            Path to a local file to scrape
      -url-match string
            Only crawl URLs containing this keyword (case-insensitive)
      -ngram string
            Lengths of n-grams (e.g., "1-3" for 1, 2, and 3-length n-grams). (default "1")
      -o string
            Output file for the n-grams
      -sort
            Sort output by frequency
      -timeout int
            Timeout for URL crawling in seconds (default 1)
      -url string
            URL of the website to scrape
      -version
            Display version
    

    Compile from source:

    • If you want the latest features, compiling from source is the best option since the release version may run several revisions behind the source code.
    • This assumes you have Go and Git installed
      • git clone https://github.com/cyclone-github/spider.git # clone repo
      • cd spider # enter project directory
      • go mod init spider # initialize Go module (skips if go.mod exists)
      • go mod tidy # download dependencies
      • go build -ldflags="-s -w" . # compile binary in current directory
      • go install -ldflags="-s -w" . # compile binary and install to $GOPATH
    • Compile from source code how-to:
      • https://github.com/cyclone-github/scripts/blob/main/intro_to_go.txt

    Changelog:

    • https://github.com/cyclone-github/spider/blob/main/CHANGELOG.md

    Mentions:

    • Go Package Documentation: https://pkg.go.dev/github.com/cyclone-github/spider
    • Softpedia: https://www.softpedia.com/get/Internet/Other-Internet-Related/Cyclone-s-URL-Spider.shtml

    Sysadmin by day | Hacker by night | Go Developer | hashpwn site owner
    3x RTX 4090

    1 Reply Last reply
    👍
    3
    • cycloneC Online
      cycloneC Online
      cyclone
      Admin Trusted
      wrote on last edited by cyclone
      #2

      v0.8.0
      New code pushed to GitHub.

      added flag "-file" to allow creating ngrams from a local plaintext file (ex: foobar.txt)
      added flag "-timeout" for -url mode
      added flag "-sort" which sorts output by frequency
      fixed several small bugs
      

      You can also use -file and -sort flags to frequency sort and dedup wordlists that contain dups.
      ex: spider -file foobar.txt -sort
      This optimizes wordlists by sorting them by probability with the most frequent occurring words being listed at the top.

      Keep in mind:

      1. This only applies to wordlists which contain dups
      2. Sorting large wordlists is RAM intensive
      3. This feature is beta, so results may vary

      Sysadmin by day | Hacker by night | Go Developer | hashpwn site owner
      3x RTX 4090

      1 Reply Last reply
      😍 👍
      0
      • cycloneC Online
        cycloneC Online
        cyclone
        Admin Trusted
        wrote on last edited by
        #3

        New Release: v0.9.0:
        https://github.com/cyclone-github/spider/releases/tag/v0.9.0

        Changelog:

        • v0.9.0 by cyclone-github in https://github.com/cyclone-github/spider/pull/7
          • added flag "-url-match" to only crawl URLs containing a specified keyword; https://github.com/cyclone-github/spider/issues/6
          • added notice to user if no URLs are crawled when using "-crawl 1 -url-match"
          • exit early if zero URLs were crawled (no processing or file output)
          • use custom User-Agent "Spider/0.9.0 (+https://github.com/cyclone-github/spider)"
          • removed clearScreen function and its imports
          • fixed crawl-depth calculation logic
          • fixed restrict link collection to .html, .htm, .txt and extension-less paths
          • upgraded dependencies and bumped Go version to v1.24.3

        Sysadmin by day | Hacker by night | Go Developer | hashpwn site owner
        3x RTX 4090

        1 Reply Last reply
        👍
        2
        Reply
        • Reply as topic
        Log in to reply
        • Oldest to Newest
        • Newest to Oldest
        • Most Votes


        Who's Online [Full List]

        6 users active right now (3 members and 3 guests).
        hashpwn-bot, cyclone

        Board Statistics

        Our members have made a total of 3.6k posts in 150 topics.
        We currently have 257 members registered.
        Please welcome our newest member, vioednfekla.
        The most users online at one time was 49 on Thursday, December 26, 2024.

        • Login

        • Don't have an account? Register

        • Login or register to search.
        • First post
          Last post
        0
        • Categories
        • Recent