Eight fundamental techniques for automating the gathering of data from business websites

December 18, 2023

95

Today, we’ll discuss easy methods for automating the gathering of contact details, such as emails, phone numbers, cryptocurrency wallet addresses, social network links, and documents, from target company websites. Anything, generally speaking, that can serve as a point of entry for an inquiry.

Although we concentrate on corporate intelligence, the majority of the methods discussed below can also be used to obtain information about websites connected to persons of interest.

You will discover how to use the following tools in this article:

Nuclei
MetaDetective
WayBackUrls
WayBackMachine Downloader
Netlas

You can use Gitpod to run the following examples if you don’t have Python, Go, and Ruby installed on your PC or are unsure if you do.

1 Using Netlas to obtain a list of subdomains

You can skip the first two steps if you are certain that the company only has one website.

There are several methods for locating websites linked to your firm. The first step involves looking for subdomains.

Open Netlas Response Search in your browser and type to search field:

host:*.lidl.com

Next, click the icon on the left to download the results, choose CSV format, input the file name and the total number of results, and choose the columns that interest you (host is required, other fields are optional).

To eliminate duplicates, import the table into a Google Sheet (in the IP field). Select Data -> Data purification -> Eliminate duplicates. Additional options include Numbers, Excel, and other analogs.

By using the Netlas Python Library, you can also automate the process of searching for subdomains. For further information about this, see Netlas CookBook.

Two further ways to use Netlas to gather websites linked to your organization

Using different search terms, you can also look for perhaps linked websites. Here are a few illustrations.

Search by organization name in Domain Whois Netlas search

“GitHub, Inc.”

Search for sites that are served by company-owned mail servers in DNS Netlas search:

mx:*.parklogic.com

Search for sites that are served by company-owned name servers in DNS Netlas search:

ns:ns?.parklogic.com

Additional methods for locating possibly linked websites include searching by Favicon, Google Analytics ID, emails in Whois contacts, and contact details in SSL certificates. More information on them can be found in Netlas CookBook.

Note that each example uses a different firm’s website; this is because it is hard to find a company for which all of the strategies would be effective at the same time.

In a similar vein, you can look up linked websites using several IP search engines, such as ZoomEye, Censys, Fofa, and Shodan.

All we have right now is a list of addresses. Currently, we have to obtain as many links to websites as we can that might have material relevant to the investigation.

3 Using WayBackUrls to obtain a list of site URLs

First, let’s gather all of the URLs that are accessible through the Archive.org CDX API and stored in archive.org.

Add a list of domains to the domains.txt file (from the domain column of the CSV Netlas output).

Install WayBackUrls:

go install github.com/tomnomnom/waybackurls@latest

Run WayBackUrls:

cat domains.txt | waybackurls > wayback_urls.txt

There are four other techniques to obtain a list of website URLs.

Unfortunately, because arhive.org lacks a lot of pages and sites, it is frequently not possible to identify all URLs available on a company’s website in this manner. There are a few further methods:

using tools for sitemap generation — python-sitemap-generator, PySiteMap, sitemap-generator. For extracting urls from XML files use xmltodict Python package. You can also use web apps such as XML-sitemaps .
using tools for search results collecting — duckduckgo-search, googlesearch-python, yandex-search, baidusearch.
using brute-force tools for finding urls and directories like GoBuster.

Once you’ve gathered links from various sources, ensure sure the list you provide has only original links.

Combine all files (remember to include domains.txt, which was gathered in step #1):

cat wayback_urls.txt duckduckgo_urls.txt gobuster_urls.txt, domains.txt> merged_urls.txt

Take the redundant string out of merged_urls.txt:

sort merged_urls.txt | uniq > urls.txt

Let’s now attempt to automatically analyze these URLs and extract the data that will be most helpful for our inquiry.

5 Use Juicy Info Nuclei Templates to extract contact information.

Nuclei

was first developed to check webpages for different vulnerabilities. Among the world’s quickest web scanners, this one is! However, it may also be used to leverage regular expression patterns to extract various website data.

Install Nuclei:

go install -v github.com/projectdiscovery/nuclei/v3/cmd/nuclei@latest

Download “Juicyinfo” Nuclei Templates:

git clone https://github.com/cipher387/juicyinfo-nuclei-templates

Using a list of URLs, let’s attempt to extract emails from the HTML code of web pages:

Using a list of URLs, let's attempt to extract emails from the HTML code of web pages:

I only included 200 urls in the list to save time while making an example for the article, which is why one email might be discovered in the image. There are frequently much more of them when the number of pages is in the tens of thousands.

Sweet Information Templates for nuclei can also be extracted:

Facebook.yaml, github.yaml, gravatar.yaml, and other social media links
telegram.yaml, twitter.yaml, youtube.yaml, linkedin.yaml;
potential handlers for nicknames: nickname.yaml;
potential telephone numbers: phonenumber.yaml
any connections—urls.yaml;
The IP addresses are ipv4.yaml.
image links: images.yaml;
addresses for cryptocurrency wallets: bitcoin_address.yaml and folder
cryptocurrency-juicy_info.

utilize commas to separate the paths to any multiple templates you wish to utilize.

6 Extract and download links to papers containing Juicy Info Nuclei Templates.

We should also discuss in separate sections how to search MS Office and PDF documents, as these frequently include crucial company information.

To locate links to documents, utilize the office documents.yaml and pdf.yaml templates:

nuclei -t juicyinfo-nuclei-templates/juicy_info/pdf.yaml -l urls.txt

Sadly, there are a variety of formats used by links to documents contained in HTML code, including localhost/file.pdf, //file.pdf, downloads/file.pdf, file.pdf, etc. Regretfully, I am unable to succinctly explain how to automate the cleaning of such a list (hint: use regular expressions).

To begin with, let’s manually amend the Nuclei URLs discovered to https://targetsite.com/file.pdf (xlsx, docx etc) format and save them to files_urls.txt.

And load them using curl:

xargs -a files_urls.txt -I{} curl -# -O {}

For instance, I’ve updated the file with links to documents on several websites. This will demonstrate how the MetaDetective tool operates in the following part.

7 Use MetaDetective to extract possibly helpful information

MetaDetective

is a straightforward Python script that gathers the most crucial data from files by analyzing their metadata.

Put MetaDetective in place:

pip install MetaDetective

Install exiftool:

sudo apt install libimage-exiftool-perl

Run MetaDetective:

MetaDetective -d /workspace/company_information_gathering_automation/

This will show the software that was used together with the identities of the users that worked on the papers.

Additionally, you may use the tool to scan URLs from website html-code and analyze the metadata of individual files, including those hosted on different servers. In other words, you couldn’t download files (unless you had other things to do), but you could use MetaDetective to find a URL:

python3 src/MetaDetective/MetaDetective.py --scraping --scan --url https://example.com/

8 Use Netlas to retrieve contacts and document links from websites that are no longer accessible.

It’s also worthwhile to look for contact details on canceled domains and web pages if your objective is to gather as much firm information as you can. This can be accomplished in at least two methods.

Download previous iterations of the pages from archive.org first.

Put wayback_machine_downloader into use:

gem install wayback_machine_downloader

And run it:

wayback_machine_downloader http://sector035.nl

Currently, Nuclei can be used to scan local files:

nuclei -u /workspace/company_information_gathering_automation/websites  -t juicyinfo-nuclei-templates/juicy_info/url.yaml

Downloading web page bodies from Netlas Search Results is the second method. Simply choose field http -> body from the export options.

After that, you can use any regular expression tool, such as Grep, the Python Re package, or any Python scraping module, like Beautiful Soup, to extract data from this file. For further information about this, see Netlas CookBook Scraping section.

That’s all for today. These were the most basic ways to learn more about a business. Maybe I’ll go into more detail in my upcoming posts about using AI to automate document analysis and GoBuster to locate hidden files on websites, among many more OSINT methods.

Eight fundamental techniques for automating the gathering of data from business websites

Related Articles

Ten AI Tools You’ve Probably Never Heard Of for Self-Improvement

Benefits Of WordPress Utilization For SEO

Google Case Study Shows API-Based Evolution of Search Console

LEAVE A REPLY Cancel reply

Latest Articles

Ten AI Tools You’ve Probably Never Heard Of for Self-Improvement

Benefits Of WordPress Utilization For SEO

Google Case Study Shows API-Based Evolution of Search Console

2024’s Top 8 Free Bootstrap Blog Templates

How Using SSPM, a $10 billion enterprise customer dramatically increased their SaaS security posture with a 201% return on investment