How do you find all pages existing on a website? The first idea that comes to mind is to google the site’s domain name.
But what about URLs that fail to get indexed? Or orphan pages? Or web cache?
Finding all the pages on a website is pretty easy; however, it requires some extra attention considering there are pages that are hidden from the eyes of visitors or search bots. This guide shows 8 different methods of finding all the site's pages along with the tools to use.
There are tons of reasons why you may need to find all pages on a website. To name a few:
1. To audit a new client’s website and find indexing issues.
Technical issues such as broken links, server errors, slow page speed, or bad mobile usability prevent Google from indexing the pages. So, site audits reveal how many URLs a site has and which of them are problematic. In the end, it helps SEOs estimate the scope of future work in the project.
2. To detect your own site’s pages that are not indexed by mistake.
If your website has duplicate content, then Google may fail to index all of the duplicates. The same concerns long redirect chains and 404 URLs: if there are many of them on a site, the crawl budget is spent in vain. As a result, the search bots visit the site less often, and it will be indexed worse overall. That is why regular audits are needed even if something looks normal in general.
3. To spot indexed pages that are not meant for Google indexing.
Some pages are not needed in the search index – for example, login pages for admins, pages in development, or shopping carts. Still, these pages might be indexed against your will because of conflicting rules or errors in your technical files. For example, if you rely solely on robots.txt to disallow a page, the URL still may get crawled and appear in search.
4. To find outdated pages and plan a complete content overhaul.
Google aims to provide the best possible results for its users, so if your content is of poor quality, thin or duplicate, then it may fail to get indexed. It is good to have a list of all your pages to know what topics you have not covered yet. With all your content inventory at hand, you will be able to plan your content strategy more effectively.
5. To find orphan pages and plan linking strategies.
Orphans are pages without incoming links, because of which users and search bots visit them rarely or do not visit them at all. Orphan pages may get indexed in Google and draw accidental users. However, a big number of orphan pages on a website spoils its authority: the site structure is not crystal clear, the pages may look unhelpful or unimportant, and all the deadwood will drag down the total visibility of the website.
6. To redesign a website and change its architecture.
To plan a website redesign and improve the user experience, you will first need to find all its pages and relevant metrics.
A clear and organized structure with a logical hierarchy of all pages can help search engines find your content easier. So, all important URLs must be reachable within one, two, or three clicks away from the homepage.
Although user experience does not affect crawling and ranking, it matters to the quality signals of your website – successful purchases, the number of returning visitors, pageviews per visitor, and tons more other metrics show how much your website is useful for the visitors.
7. To analyze competitors’ websites.
By auditing your competitors’ pages, you can dig deeper into their SEO strategies: reveal their top traffic pages, the most linked to pages, the best referral sources, etc. This way, you can get valuable insights and learn works well for your competitors. You can borrow their techniques and compare results to see how to improve your own website.
There are many ways to find all pages on a website, but for each case, you can use a different method to do that. So, let’s see the pros and cons of each method and how to employ it with no fuss.
Google search can quickly help find all the pages of a website. Simply enter the "site: your domain" into the search bar, and Google will show you all the pages of the website that it has indexed.
However, it's important to remember that the search results shown by the “site:” operator do not necessarily reflect the precise number of your site’s indexed pages.
First, there is no guarantee that Google will index every page straight after it has crawled it. It may exclude certain pages from the index for various reasons: for example, it considers some pages as duplicates or of low quality.
Second, the “site:” search operator may also show pages that have been removed from your website, but they are kept as cached or archived pages on Google.
Therefore, the “site:” search query is a good start to get an approximate picture of how large your site is. But to find the rest of the pages that might be missing from the index, you will need some other tools.
Robots.txt is a technical file that instructs search bots about how to crawl your website, with the help of the allow/disallow rules for individual pages or whole directories.
Thus, the file will not show you all the pages on your site. However, it can help you locate pages that are banned from being accessed by search bots.
Here are the steps on how to find the restricted pages using robots.txt:
Here is an example of robots directives for YouTube.
Check how it works. For example, the sign-up page is disallowed. However, you still can get it when searching on Google – notice that no descriptive information is available for the page.
It is necessary to recheck your robots.txt rules to make sure that all of your pages are crawled properly. So, you might need a tool such as Google Search Console or a site crawler to review it. I’ll dwell on it in a moment.
And so far, if you want to learn more about the purpose of the file, read this guide to hiding web pages from indexing.
A sitemap is another technical file that webmasters use for proper site indexing. This document, often in XML format, lists all the URLs on a website that should be indexed. A sitemap is a valuable source of information about a website’s structure and content.
Large websites may have several sitemaps: as the file is limited by size to 50,000 URLs and 50 MB, it can be split into several ones and include a separate sitemap for directories, images, videos, etc. E-commerce platforms like Shopify or Wix generate sitemaps automatically. For others, there are plugins or sitemap generator tools to create the files.
Among all, a website's sitemap lets you easily find all pages on it and ensure that they are indexed:
You should also recheck the correctness of your sitemap once in a while, as it may have issues too: it might be blank, responding with a 404 code, cached long ago, or it may simply contain the wrong URLs that you don’t want to appear in the index.
A good method to validate your sitemap is to use a website crawling tool. There are several website crawler tools available online, and one of them is WebSite Auditor which is a powerful SEO tool for sitewide audits. Let’s see how it can help you find all the pages on a website and validate technical files.
Here is how you can use WebSite Auditor to find all the pages on your website:
You can specify the instructions for a certain search bot or user agent; tell the crawler to ignore URL parameters, crawl a password-protected site, crawl a domain alone or together with subdomains, etc.
After you click OK, the tool will audit your site and collect all the pages in the Site Structure > Pages section.
WebSite Auditor will help you recheck if the URLs are properly optimized for search engines. You will get to know the tool in a few minutes, as the setup is quick, and the interface is pretty intuitive.
Let’s see what you can get from the website crawling tool.
In the All pages tab, you can sort the list by URL, title, or any other column by clicking on the column header.
You can use the search box to filter the list of pages by keyword or page URL. This can be helpful if you're looking for a specific page or group of pages.
Besides, you can add visible columns to present more information about this page, such as meta tags, headings, keywords, redirects, or any other on-page SEO element.
Finally, you can click on any URL to examine all resources on the page in the lower half of the workspace.
All the data can be handled inside the tool or copy/exported in CSV or Excel format.
The Site Audit section will show you lists of pages split by types of errors, such as:
Under each type of issue, you will see an explanation of why this factor is important and a few suggestions on how to fix it.
Besides, you can examine your visual sitemap in the Site Structure > Visualization which shows relations between all your URLs. The interactive map allows you to add or remove pages and links to adjust your site structure. You can recalculate the Internal PageRank value and check the Pageviews (as tracked by your Google Analytics).
On top of that, WebSite Auditor also checks the availability of both your robots.txt file and the sitemap.
It lets you edit the technical files in the Website tools and upload them straight to your site with the proper settings.
You will not need to observe any special syntax when editing the files – just select the required URLs and apply the necessary rules. Then, click to generate the files and save them to your computer or upload to the site via FTP.
One more great tool to discover all your site’s pages is Google Search Console. It will help you check the pages’ indexing and reveal the issues that hamper search bots from correctly indexing these URLs.
You can get a breakdown of all your pages by their indexing status, including those pages that have not been indexed yet.
Here is how to find all your site’s pages with Search Console:
1. Go to the Indexing report and click View data about indexed pages. You will see all the pages that the search bot last crawled on your website. However, mind that there will be a limit in the table of up to 1,000 URLs. There is a quick filter to sort all known pages from all submitted URLs, etc.
2. Enable the Not indexed tab. Below, the tool gives you the details on why each URL is not indexed.
Click on each reason and see the URLs affected by the issue.
The difficulty is that you will get not only the main URLs of your pages, but also anchor links, pagination pages, URL parameters, and other garbage that requires manual sorting. And the list might be incomplete because of the 1,000 entries limit in the table.
Among other things, mind that different search engines may have other indexing rules, and you need to use their webmaster tools to find and handle such issues. For example, use Bing Webmaster tools, Yandex Webmaster, Naver Webmaster, and others to check indexing in the respective search engines.
I guess Google Analytics is one of the most widely used analytics platforms, so any website owner or editor is familiar with it. The good old Universal Analytics is going to be soon replaced by Google Analytics 4. So, let’s see both versions of the tool.
To collect your site’s pages in Google's Universal Analytics, follow these steps:
You will see the pages with their user behavior stats, such as pageviews, bounce rate, average time on page, etc. Pay attention to pages with the least number of pageviews over all time – probably, they are orphan pages.
To recreate a similar flow in Google Analytics 4:
Just like with the Console, it will include URL parameters and the like. You can export the list of pages as a CSV or an Excel sheet by clicking on the Export button at the top of the page.
Some websites are really huge, and even powerful SEO spiders may have a hard time crawling all of their pages. Log analysis is a good option for finding and examining all pages on large websites.
By analyzing your website's log file, you can identify all the pages that get visitors from the web, their HTTP responses, how frequently crawlers visit the pages, and so forth.
Log files rest on your server, and you’ll need the required level of access to retrieve it and a log analyzer tool. So, this method is more suitable for tech-savvy people, webmasters or developers.
Here are the steps to find all your site's pages using log analysis:
Another way to find all pages on a website is to refer to your Content Management System (CMS), as it will contain all the URLs on the website you have once created. An example of CMSs are Wordpress or Squarespace which contain website building tools for content editing in different domains – news and blogging, e-commerce, corporate sites, and the like.
Although CMSs are quite different by appearance, the general steps apply to most of them:
Mind that there can be categories, blog posts, or landing pages, which are different types of pages that may belong to different sections in the CMS.
Most CMSs allow sorting the URLs by the date of their creation, author, category, or some other criteria. You can also use the search box to filter the list of pages by keywords or titles.
To find all pages on a website, there is a great array of methods and tools. The one you choose depends on the purpose and the scope of work to do.
I hope you’ve found this list helpful and will now be able to easily collect all your site’s pages even if you’re new to SEO.
If you have a question not answered yet, feel free to ask in our user group on Facebook.