How to Find All Pages on a Website – 8 Easy Ways

•

11-minute read

Author

Tatiana Tsyulia

Date

Apr 18, 2023

How do you find all pages existing on a website? The first idea that comes to mind is to google the site’s domain name.

But what about ‌URLs that fail to get indexed? Or orphan pages? Or web cache?

Finding all the pages on a website is pretty easy; however, it requires some extra attention considering there are pages that are hidden from the eyes of visitors or search bots. This guide shows 8 different methods of finding all the site's pages along with the tools to use.

Why you may need to find all pages on a website
1. Look it up with Google search operators
2. Check the robots.txt file
3. Examine the sitemap
4. Crawl with an SEO spider
5. Check your pages in Search Console
6. Use Google Analytics
7. Analyze logs
8. Work with your CMS

Why you may need to find all pages on a website

There are tons of reasons why you may need to find all pages on a website. To name a few:

1. To audit a new client’s website and find indexing issues.

Technical issues such as broken links, server errors, slow page speed, or bad mobile usability prevent Google from indexing the pages. So, site audits reveal how many URLs a site has and which of them are problematic. In the end, it helps SEOs estimate the scope of future work in the project.

2. To detect your own site’s pages that are not indexed by mistake.

If your website has duplicate content, then Google may fail to index all of the duplicates. The same concerns long redirect chains and 404 URLs: if there are many of them on a site, the crawl budget is spent in vain. As a result, the search bots visit the site less often, and it will be indexed worse overall. That is why regular audits are needed even if something looks normal in general.

3. To spot indexed pages that are not meant for Google indexing.

Some pages are not needed in the search index – for example, login pages for admins, pages in development, or shopping carts. Still, these pages might be indexed against your will because of conflicting rules or errors in your technical files. For example, if you rely solely on robots.txt to disallow a page, the URL still may get crawled and appear in search.

4. To find outdated pages and plan a complete content overhaul.

Google aims to provide the best possible results for its users, so if your content is of poor quality, thin or duplicate, then it may fail to get indexed. It is good to have a list of all your pages to know what topics you have not covered yet. With all your content inventory at hand, you will be able to plan your content strategy more effectively.

5. To find orphan pages and plan linking strategies.

Orphans are pages without incoming links, because of which users and search bots visit them rarely or do not visit them at all. Orphan pages may get indexed in Google and draw accidental users. However, a big number of orphan pages on a website spoils its authority: the site structure is not crystal clear, the pages may look unhelpful or unimportant, and all the deadwood will drag down the total visibility of the website.

6. To redesign a website and change its architecture.

To plan a website redesign and improve the user experience, you will first need to find all its pages and relevant metrics.

A clear and organized structure with a logical hierarchy of all pages can help search engines find your content easier. So, all important URLs must be reachable within one, two, or three clicks away from the homepage.

Although user experience does not affect crawling and ranking, it matters to the quality signals of your website – successful purchases, the number of returning visitors, pageviews per visitor, and tons more other metrics show how much your website is useful for the visitors.

7. To analyze competitors’ websites.

By auditing your competitors’ pages, you can dig deeper into their SEO strategies: reveal their top traffic pages, the most linked to pages, the best referral sources, etc. This way, you can get valuable insights and learn works well for your competitors. You can borrow their techniques and compare results to see how to improve your own website.

There are many ways to find all pages on a website, but for each case, you can use a different method to do that. So, let’s see the pros and cons of each method and how to employ it with no fuss.

1. Look it up with Google search operators

Google search can quickly help find all the pages of a website. Simply enter the "site: your domain" into the search bar, and Google will show you all the pages of the website that it has indexed.

Site search operator returns all URLs that Google finds on your website

The results of site:search show all URLs that Google has found on your site

However, it's important to remember that the search results shown by the “site:” operator do not necessarily reflect the precise number of your site’s indexed pages.

First, there is no guarantee that Google will index every page straight after it has crawled it. It may exclude certain pages from the index for various reasons: for example, it considers some pages as duplicates or of low quality.

Second, the “site:” search operator may also show pages that have been removed from your website, but they are kept as cached or archived pages on Google.

Therefore, the “site:” search query is a good start to get an approximate picture of how large your site is. But to find the rest of the pages that might be missing from the index, you will need some other tools.

2. Check the robots.txt file

‌Robots.txt is a technical file that instructs search bots about how to crawl your website, with the help of the allow/disallow rules for individual pages or whole directories.

Thus, the file will not show you all the pages on your site. However, it can help you locate ‌pages that are banned from being accessed by search bots.

How-to

Here are the steps on how to find the restricted pages using robots.txt:

Find the robots.txt file on the website. It is usually located in the root directory, so you can type in example.com/robots.txt, and there it will be.
Open the file in a text editor or browser.
Look at the “User-agent” line that specifies the search engine crawler to which the following rules apply.
Look for the “Disallow” rules. These lines specify the pages or directories that the search engine crawler is not allowed to access.
If you’ve found any, examine the URLs and directories that are blocked.

Here is an example of robots directives for YouTube.

Robots directives for YouTube website

Check how it works. For example, the sign-up page is disallowed. However, you still can get it when searching on Google – notice that no descriptive information is available for the page.

A page disallowed by robots directives shows up in search results

It is necessary to recheck your robots.txt rules to make sure that all of your pages are crawled properly. So, you might need a tool such as Google Search Console or a site crawler to review it. I’ll dwell on it in a moment.

And so far, if you want to learn more about the purpose of the file, read this guide to hiding web pages from indexing.

3. Examine the sitemap

A sitemap is another technical file that webmasters use for proper site indexing. This document, often in XML format, lists all the URLs on a website that should be indexed. A sitemap is a valuable source of information about a website’s structure and content.

Large websites may have several sitemaps: as the file is limited by size to 50,000 URLs and 50 MB, it can be split into several ones and include a separate sitemap for directories, images, videos, etc. E-commerce platforms like Shopify or Wix generate sitemaps automatically. For others, there are plugins or sitemap generator tools to create the files.

How-to

Among all, a website's sitemap lets you easily find all pages on it and ensure that they are indexed:

Look for a link to the sitemap in the footer or header of the website. The sitemap is usually located at yourdomain.com/sitemap.xml or a similar URL. You can also check the robot’s file because it is the most common place to include a reference to the sitemap.
Open the sitemap in a text editor or XML viewer.
Look at the <loc> tags in the sitemap file. These tags contain the URL of each page on the website.
You can copy the URLs from the <loc> tags into a spreadsheet or text document.

An example of a sitemap with all subcategories

An example of several sitemaps listing all pages on a website

You should also recheck the correctness of your sitemap once in a while, as it may have issues too: it might be blank, responding with a 404 code, cached long ago, or it may simply contain the wrong URLs that you don’t want to appear in the index.

A good method to validate your sitemap is to use a website crawling tool. There are several website crawler tools available online, and one of them is WebSite Auditor which is a powerful SEO tool for sitewide audits. Let’s see how it can help you find all the pages on a website and validate technical files.

4. Crawl with an SEO spider

How-to

Here is how you can use WebSite Auditor to find all the pages on your website:

Launch WebSite Auditor and type in the URL of your website to create a new project.
Check the Advanced settings box and complete the setup indicating the exact crawl parameters. (If you don’t know yet what to look for, skip the advanced setup and let the SEO spider crawl your site with default settings.)
In the Advanced settings, you have several options to make sure that the website crawler finds all pages. For example, tick the Search for orphan pages, and it will collect all the URLs without incoming links.

You can specify the instructions for a certain search bot or user agent; tell the crawler to ignore URL parameters, crawl a password-protected site, crawl a domain alone or together with subdomains, etc.

Setting up the web crawler to find all pages, including orphan URLs

Setting up the web crawler to find all pages, including those unlinked from any other pages

After you click OK, the tool will audit your site and collect all the pages in the Site Structure > Pages section.

WebSite Auditor will help you recheck if the URLs are properly optimized for search engines. You will get to know the tool in a few minutes, as the setup is quick, and the interface is pretty intuitive.

Let’s see what you can get from the website crawling tool.

Collect the list of pages with all their resources

In the All pages tab, you can sort the list by URL, title, or any other column by clicking on the column header.

Get the list of all pages with all resources on them in Site structure > Pages section

You can use the search box to filter the list of pages by keyword or page URL. This can be helpful if you're looking for a specific page or group of pages.

Besides, you can add visible columns to present more information about this page, such as meta tags, headings, keywords, redirects, or any other on-page SEO element.

Finally, you can click on any URL to examine all resources on the page in the lower half of the workspace.

All the data can be handled inside the tool or copy/exported in CSV or Excel format.

Get lists of pages affected by technical errors

The Site Audit section will show you lists of pages split by types of errors, such as:

Duplicate issues
Faulty redirects and redirect chains
Pages restricted from indexing
Broken resources

Find all site's pages listed by their type of errors

Under each type of issue, you will see an explanation of why this factor is important and a few suggestions on how to fix it.

See the visualized site structure

Besides, you can examine your visual sitemap in the Site Structure > Visualization which shows relations between all your URLs. The interactive map allows you to add or remove pages and links to adjust your site structure. You can recalculate the Internal PageRank value and check the Pageviews (as tracked by your Google Analytics).

See all site's pages in a visual sitemap

Use generator tools to validate technical files

On top of that, WebSite Auditor also checks the availability of both your robots.txt file and the sitemap.

It lets you edit the technical files in the Website tools and upload them straight to your site with the proper settings.

Sitemap generator tool in Website Auditor

Creating a sitemap in WebSite Auditor

You will not need to observe any special syntax when editing the files – just select the required URLs and apply the necessary rules. Then, click to generate the files and save them to your computer or upload to the site via FTP.

Robots.txt generator tool in WebSite Auditor

Editing robots directives in WebSite Auditor

5. Check your pages in Search Console

One more great tool to discover all your site’s pages is Google Search Console. It will help you check the pages’ indexing and reveal the issues that hamper search bots from correctly indexing these URLs.

How-to

You can get a breakdown of all your pages by their indexing status, including those pages that have not been indexed yet.

Here is how to find all your site’s pages with Search Console:

1. Go to the Indexing report and click View data about indexed pages. You will see all the pages that the search bot last crawled on your website. However, mind that there will be a limit in the table of up to 1,000 URLs. There is a quick filter to sort all known pages from all submitted URLs, etc.

All indexed pages in Google Search Console

All indexed pages in Search Console

2. Enable the Not indexed tab. Below, the tool gives you the details on why each URL is not indexed.

Pages that failed to get indexed by Google

All site's pages that Google have not indexed yet

Click on each reason and see the URLs affected by the issue.

The difficulty is that you will get not only the main URLs of your pages, but also anchor links, pagination pages, URL parameters, and other garbage that requires manual sorting. And the list might be incomplete because of the 1,000 entries limit in the table.

Among other things, mind that different search engines may have other indexing rules, and you need to use their webmaster tools to find and handle such issues. For example, use Bing Webmaster tools, Yandex Webmaster, Naver Webmaster, and others to check indexing in the respective search engines.

6. Use Google Analytics

I guess Google Analytics is one of the most widely used analytics platforms, so any website owner or editor is familiar with it. The good old Universal Analytics is going to be soon replaced by Google Analytics 4. So, let’s see both versions of the tool.

How-to

To collect your site’s pages in Google's Universal Analytics, follow these steps:

In your Google Analytics account, select the website you want to explore.
Go to the Behavior module in the left-hand sidebar.
Select Site Content > All Pages tab. You should now see a list of all the pages on your website that have been tracked by Google Analytics.

Seeing all your pages in Google's Universal Analytics

Seeing all your pages in Universal Analytics

You will see the pages with their user behavior stats, such as pageviews, bounce rate, average time on page, etc. Pay attention to pages with the least number of pageviews over all time – probably, they are orphan pages.

To recreate a similar flow in Google Analytics 4:

Go to Reports > Engagement module.
Select the Pages and screens section.
Change the dimension from Page title and screen class to Page path and screen class. You should now see a table showing all URLs on your website that have been tracked by Google Analytics 4.

Finding all your website pages in Google Analytics 4

Just like with the Console, it will include URL parameters and the like. You can export the list of pages as a CSV or an Excel sheet by clicking on the Export button at the top of the page.

7. Analyze logs

Some websites are really huge, and even powerful SEO spiders may have a hard time crawling all of their pages. Log analysis is a good option for finding and examining all pages on large websites.

By analyzing your website's log file, you can identify all the pages that get visitors from the web, their HTTP responses, how frequently crawlers visit the pages, and so forth.

Log files rest on your server, and you’ll need the required level of access to retrieve it and a log analyzer tool. So, this method is more suitable for tech-savvy people, webmasters or developers.

How-to

Here are the steps to find all your site's pages using log analysis:

Download your website's server logs and open them with the log analysis tool of your choice.
Filter the log data by HTTP status code. It will help you identify all the pages on your website that have drawn some visitors.
Look for log entries with a 200 status code which indicates that the page was successfully accessed. You can also filter by other status codes to find pages that have been redirected, such as 301 or 302 redirects.
Just like with other tools, you can export the list of pages to a spreadsheet or another format for further analysis.

8. Work with your CMS

Another way to find all pages on a website is to refer to your Content Management System (CMS), as it will contain all the URLs on the website you have once created. An example of CMSs are Wordpress or Squarespace which contain website building tools for content editing in different domains – news and blogging, e-commerce, corporate sites, and the like.

How-to

Although CMSs are quite different by appearance, the general steps apply to most of them:

Log into your CMS dashboard and navigate to the page or post section.
Look for a list of all pages or posts on your website – in a sidebar, submenu, or separate page.
Click on the All Pages or All Posts link to view a list of all the pages on your website.

Mind that there can be categories, blog posts, or landing pages, which are different types of pages that may belong to different sections in the CMS.

Finding all your site's pages in WordPress CMS

Most CMSs allow sorting the URLs by the date of their creation, author, category, or some other criteria. You can also use the search box to filter the list of pages by keywords or titles.

Summary

To find all pages on a website, there is a great array of methods and tools. The one you choose depends on the purpose and the scope of work to do.

I hope you’ve found this list helpful and will now be able to easily collect all your site’s pages even if you’re new to SEO.

If you have a question not answered yet, feel free to ask in our user group on Facebook.

Article stats:

Linking websites	N/A
Backlinks	N/A
InLink Rank	N/A

Data from: backlink checker tool.

Contents

Why you may need to find all pages on a website

1. Look it up with Google search operators

2. Check the robots.txt file

3. Examine the sitemap

4. Crawl with an SEO spider

Collect the list of pages with all their resources

Get lists of pages affected by technical errors

See the visualized site structure

Use generator tools to validate technical files

5. Check your pages in Search Console

6. Use Google Analytics

7. Analyze logs

8. Work with your CMS

Summary