+rs\
HOW THE SEARCH ENGINE WORKS; CRAWLING, INDEXING, AND RANKING IN SEARCH ENGINES | SEOTools.Tel

HOW THE SEARCH ENGINE WORKS; CRAWLING, INDEXING, AND RANKING IN SEARCH ENGINES

Search Engine Optimization
Dec
31

HOW THE SEARCH ENGINE WORKS; CRAWLING, INDEXING, AND RANKING IN SEARCH ENGINES

12/31/2021 12:00 AM

                  CRAWLING, INDEXING, AND RANKING IN SEARCH ENGINES
                                                                                 First and foremost, show up.

Search engines, as we discussed, are answer machines. They exist to find, understand, and organize content on the internet in order to provide the most appropriate answers to searchers' requests.

Your material must first be visible to search engines in order to appear in search results. It's perhaps the most crucial aspect of SEO: if your website can't be found, there's no chance you'll ever appear in the SERPs (Search Engine Results Page).

What is the process of using a search engine?
The three basic purposes of search engines are as follows:

Crawling: Searching the Internet for content and examining the code/content for each URL they come across.
Indexing is the process of storing and organizing the content discovered during the crawling process. Once a page is included in the index, it is eligible to be presented as a result of a search. questions that are relevant
Provide the bits of content that will best answer a searcher's query, i.e., results are ranked from most relevant to least relevant.
What is crawling by a search engine?
Crawling is the process through which search engines dispatch a team of robots (also known as crawlers or spiders) to search for new and updated content. Content can take many forms — a webpage, an image, a video, a PDF, and so on — but regardless of the format, links are used to find it.

Googlebot begins by retrieving a few online pages, then follows the links on those sites to discover new URLs. The crawler can locate new content by hopping along this trail of links and adding it to their Caffeine index — a vast database of discovered URLs — to be retrieved later when a searcher is looking for information that the content on that URL is a good match for.

What is the definition of a search engine index?
Search engines analyze and store the information they uncover in an index, which is a massive database of all the stuff they've found and deemed suitable for serving to searchers.

Rank in the search engines
When a user conducts a search, search engines explore its index for highly relevant content, which is then ordered in the hopes of answering the searcher's question. The ranking is the process of sorting search results by relevance. 

In general, the higher a website's ranking, the more relevant the search engine considers the site to be to the query.

It's possible to prevent search engine crawlers from accessing parts or all of your site or to tell search engines not to index certain pages. While there are valid reasons for doing so, if you want your content to be found by search engines, you must first ensure that it is crawlable and indexable. Otherwise, it's almost undetectable.

You'll have the background you need to work with the search engine rather than against it at the end of this chapter!

rather than being opposed to it!

 

Not all search engines are created equal when it comes to SEO.
Many newcomers are unsure of the relative relevance of several search engines. Although most people are aware that Google has the highest market share, how critical is it to optimize for Bing, Yahoo, and other search engines? Despite the fact that there are more than 30 major web search engines, the SEO community mostly focuses on Google. Why? The quick answer is that the great majority of people use Google to search the internet. Larger than 90% of web searches happen on Google, including Google Images, Google Maps, and YouTube (a Google property), which is roughly 20 times more than Bing and Yahoo combined.

Can search engines find your pages by crawling?
As you've just learned, getting your site crawled and indexed is a must if you want to appear in the SERPs. If you already have a website, it's a good idea to update it.

It's a good idea to start by looking at how many of your pages are included in the index. This will provide you with a lot of information on whether Google is crawling and finding all of the pages you want it to and none of the ones you don't.

The advanced search operator "site:yourdomain.com" is one technique to examine your indexed pages. Go to Google and search for "site:yourdomain.com" in the search box. This will provide the following results from Google's index for the chosen site:

Monitor and use the Index Coverage data in Google Search Console for more accurate results. If you don't already have a Google Search Console account, you can create one for free. You can use this tool to submit sitemaps for your website and see how many of the pages you submit have been included to Google's index, among other things.

There are a number of possible reasons why you aren't showing up in the search results:

Your site is new and has not yet been crawled.
There are no links to your site from other websites.
The navigation on your site makes it difficult for a robot to adequately crawl it.
Crawler directives, a type of fundamental code on your site, are preventing search engines from indexing it.
Google has punished your website for using spammy practices.
Indicate how search engines should crawl your site.
If you utilize Google Search Console or the advanced search "site:domain.com,"

There are certain adjustments you may do to better guide Googlebot on how you want your web content crawled if you've used the operator and discovered that some of your critical pages are missing from the index and/or some of your irrelevant pages have been accidentally indexed. You can have more control over what is indexed by telling search engines how to crawl your site.

Most people think about ensuring that Googlebot can find their key pages, but it's easy to overlook the fact that there are undoubtedly pages you don't want Googlebot to find. Old URLs with little content, duplicate URLs (such as e-commerce sort-and-filter parameters), special promo code pages, staging or test pages, and so on are examples.

Use robots.txt to keep Googlebot away from particular pages and areas of your site.

Robots.txt
Robots.txt files are found in the root directory of websites (for example, yourdomain.com/robots.txt) and use special robots.txt directives to tell search engines which portions of your site they should and shouldn't explore, as well as the speed at which they crawl your site.

Googlebot's approach to robots.txt files
When Googlebot can't discover a robots.txt file for a website, it crawls it.
If a site's robots.txt file is found, Googlebot will usually follow the recommendations. and then start crawling the site
If Googlebot finds an error when attempting to access a site's robots.txt file and is unable to verify whether or not one exists, the site will not be crawled.
Make crawl budget optimization a priority!
Crawl budget optimization ensures that Googlebot doesn't waste time crawling through your uninteresting pages, risking disregarding your crucial pages. The crawl budget is most critical for very large sites with tens of thousands of URLs, although blocking crawlers from accessing stuff you don't care about is never a bad idea. Just make sure you don't prevent crawlers from accessing sites that have additional directives, like canonical or non-index tags. Googlebot won't be able to see the instructions on a website if it is barred from it

Robots.txt is not followed by all web robots. Bots that don't follow this protocol are built by people with malicious intentions (e.g., e-mail address scrapers). In fact, some malicious actors use robots.txt files to figure out where you save your personal information. Although it may seem rational to prevent crawlers from accessing secret pages such as login and administration pages so that they do not appear in the index, putting the location of those URLs in a publicly accessible robots.txt file makes it easier for anyone with malicious intent to find them. Rather than including these pages in your robots.txt file, you should NoIndex them and lock them behind a login form.

More information on this can be found in the robots.txt section of our Learning Center.

Using GSC to define URL parameters
By attaching particular parameters to URLs, certain sites (most commonly used in e-commerce) make the same content available on numerous different URLs.

If you've ever done any online shopping, you've probably used filters to narrow down your results. For example, on Amazon, you may search for "shoes" and then narrow your results by size, color, and style. The URL varies somewhat each time you refine it:

https://www.example.com/products/women/dresses/green.htm

https://www.example.com/products/women?category=dresses&color=green

https://example.com/shopindex.php?product id=32&highlight=green+dress&cat id=1&sessionid=123$affid=43
How does Google determine which version of the URL to show to users? Although Google does a decent job of determining the representative URL on its own, you can utilize the URL Parameters Tool in Google Search Console to inform Google how you want your pages to be treated. If you use this function to tell Googlebot to "crawl no URLs with parameter," you're effectively asking Googlebot to hide this content from it, which may result in the pages being removed from search results. If those criteria result in duplicate pages, that's what you want, but it's not ideal if you want those pages to be indexed.

Is it possible for crawlers to find all of your important content?
Let's look at some optimizations that will assist Googlebot to identify your key sites now that you know how to keep search engine crawlers away from your unimportant content.

When a search engine crawls your site, it may be able to identify some pages or portions, while other pages or sections may be hidden for various reasons. It's critical to ensure that search engines can find all of the information you want to be indexed, not just your homepage.

Consider this: Is it possible for the bot to crawl through your website rather than simply to it?

Is it possible for crawlers to find all of your important content?
Let's look at some optimizations that will assist Googlebot to identify your key sites now that you know how to keep search engine crawlers away from your unimportant content.

When a search engine crawls your site, it may be able to identify some pages or portions, while other pages or sections may be hidden for various reasons. It's critical to ensure that search engines can find all of the information you want to be indexed, not just your homepage.

Consider this: Is it possible for the bot to crawl through your website rather than simply to it?

indicating that the requested URL has incorrect syntax or cannot be completed. The "404 – not found" problem is one of the most prevalent 4xx faults. These can happen as a result of a URL typo, a deleted page, or a failed redirect, to mention a few possibilities. When search engines encounter a 404 error, they are unable to access the URL. Users may become frustrated and depart if they receive a 404 error.

5xx Codes: When a server fault prevents search engine crawlers from accessing your material.
5xx errors are server errors, which indicate that the server hosting the web page failed to respond to the searcher's or search engine's request to access the page. There is a tab dedicated to these problems in Google Search Console's "Crawl Error" report. These occur when Googlebot abandons a request for a URL because the request timed out. More information regarding resolving server connectivity issues can be found in Google's documentation.

The 301 (permanent) redirect is a great approach to inform both searchers and search engines that your website has relocated.

Make your own 404 pages!
Add connections to vital pages on your site, a site search feature, and even contact information to your 404 page. Visitors should be less inclined to leave your site if they encounter a 404 error as a result of this.

Learn more about 404 pages that are customized.
A representation of a page being redirected to another.
Let's say you want to change the URL of a page from example.com/young-dogs/ to example.com/puppies/. To go from the old URL to the new, search engines and users need a bridge. A 301 redirect is what that bridge is.

If you do decide to install a 301:

When a 301 isn't implemented:
Link Equity moves the page's link equity from the old URL to the new one.
The authority from the preceding URL is not sent on without a 301.

If you do decide to install a 301: When a 301 isn't implemented:
Link Equity moves the page's link equity from the old URL to the new one.
The authority from the previous URL is not passed on to the new version of the URL if a 301 is not used.
Indexing assists Google in locating and indexing the updated version of the page.
Although the presence of 404 errors on your site does not impact search performance, allowing ranked/visited sites to 404 might result in them being removed from the index, along with their ranks and traffic - ouch!

Users are more likely to find the page they're looking for if the user experience is good.

Allowing users to click on dead links may redirect them to error pages rather than the desired page, which might be inconvenient.

Avoid redirecting URLs to the 301 status code since it indicates that the page has permanently migrated to a new place.

irrelevant pages – URLs where the content of the previous URL no longer exists. If a page ranks for a query and then gets 301ed to a URL with different content, the page's rank may drop because the material that made it relevant to that query is no longer there. 301 redirects are powerful, so use them wisely!

 

You can also 302 redirect a page, but this should only be used for short movements and situations when conveying link equity isn't as important. A 302 is similar to a detour on the highway. You're temporarily diverting traffic through a specific path, but that won't last.

 

Be wary of redirect chains!