What is a search engine and how does it work. How do search engines work? Internet search engines how they work

The Internet is necessary for many users in order to receive answers to the queries (questions) they enter.

If there were no search engines, users would have to search for the necessary sites on their own, remember them, and write them down. In many cases, finding something suitable "manually" would be very difficult, and often simply impossible.

For us, all this routine work on searching, storing and sorting information on sites.

Let's start with the well-known Runet search engines.

Internet search engines in Russian

1) Let's start with the domestic search engine. Yandex works not only in Russia, but also works in Belarus and Kazakhstan, in Ukraine, in Turkey. There is also Yandex in English.

2) The Google search engine came to us from America, has Russian-language localization:

3) The domestic search engine Mile ru, which simultaneously represents the social network VKontakte, Odnoklassniki, also My World, the famous Answers Mail.ru and other projects.

4) Intelligent search engine

Nigma (Nigma) http://www.nigma.ru/

Since September 19, 2017, the “intellectual” nigma has not been working. She ceased to be of financial interest to her creators, they switched to another search engine called CocCoc.

5) The well-known company Rostelecom created the Sputnik search engine.

There is a Sputnik search engine, designed specifically for children, about which I wrote.

6) Rambler was one of the first domestic search engines:

There are other famous search engines in the world:

  • bing,
  • Yahoo!
  • Baidu,
  • ecosia,

Let's try to figure out how the search engine works, namely, how sites are indexed, indexing results are analyzed and search results are generated. The principles of operation of search engines are approximately the same: searching for information on the Internet, storing it and sorting it for issuance in response to user requests. But the algorithms used by search engines can be very different. These algorithms are kept secret and its disclosure is prohibited.

By entering the same query in the search boxes of different search engines, you can get different answers. The reason is that all search engines use their own algorithms.

Purpose of search engines

First of all, you need to know that search engines are commercial organizations. Their goal is to make a profit. Profit can be received from contextual advertising, other types of advertising, from the promotion of the necessary sites to the top lines of the issue. In general, there are many ways.

It depends on the size of the audience he has, that is, how many people use this search engine. The larger the audience, the more people the ad will be shown. Accordingly, this advertising will cost more. Search engines can increase the audience through their own advertising, as well as by attracting users by improving the quality of their services, algorithm and search convenience.

The most important and difficult thing here is the development of a full-fledged functioning search algorithm that would provide relevant results for most user queries.

The work of the search engine and the actions of webmasters

Each search engine has its own algorithm, which must take into account a huge number of different factors when analyzing information and compiling results in response to a user request:

  • the age of a particular site,
  • site domain characteristics,
  • the quality of the content on the site and its types,
  • site navigation and structure features,
  • usability (user-friendliness),
  • behavioral factors (the search engine can determine whether the user found what he was looking for on the site or the user returned to the search engine again and is looking for the answer to the same query there again)
  • etc.

All this is necessary precisely to ensure that the issuance at the request of the user is as relevant as possible, satisfying the user's needs. At the same time, search engine algorithms are constantly changing and improving. As they say, there is no limit to perfection.

On the other hand, webmasters and SEOs are constantly inventing new ways to promote their sites, which are not always fair. The task of the developers of the search engine algorithm is to make changes to it that would not allow “bad” sites of dishonest optimizers to appear in the TOP.

How does a search engine work?

Now about how the direct work of the search engine takes place. It consists of at least three stages:

  • scanning,
  • indexing,
  • ranging.

The number of sites on the Internet is simply astronomical. And each site is information, informational content that is created for readers (real people).

Scanning

This is a search engine roaming the Internet to collect new information, to analyze links and find new content that can be used to give the user in response to his queries. For scanning, search engines have special robots, which are called search robots or spiders.

Search robots are programs that automatically visit websites and collect information from them. Crawling can be primary (the robot visits a new site for the first time). After the initial collection of information from the site and entering it into the database of the search engine, the robot begins to visit its pages with a certain regularity. If there have been any changes (new content added, old content removed), then all these changes will be fixed by the search engine.

The main task of the search spider is to find new information and give it to the search engine for the next stage of processing, that is, for indexing.

Indexing

The search engine can search for information only among those sites that are already included in its database (indexed by it). If scanning is the process of searching and collecting information that is available on a particular site, then indexing is the process of entering this information into the search engine database. At this stage, the search engine automatically decides whether to enter this or that information into its database and where to enter it, in which section of the database. For example, Google indexes almost all information found by its robots on the Internet, while Yandex is more picky and does not index everything.

For new sites, the indexing phase can be long, so visitors from search engines can wait a long time for new sites. And new information that appears on old, promoted sites can be indexed almost instantly and almost immediately get into the "index", that is, into the database of search engines.

Ranging

Ranking is the alignment of information that was previously indexed and entered into the database of a particular search engine, according to rank, that is, what information the search engine will show to its users in the first place, and what information will be placed “rank” lower. Ranking can be attributed to the stage of service by the search engine of its client - the user.

On the servers of the search engine, the received information is processed and the issue is generated for a huge range of all kinds of queries. This is where search engine algorithms come into play. All sites listed in the database are classified by topics, the topics are divided into groups of requests. For each of the groups of requests, a preliminary issuance can be compiled, which will subsequently be adjusted.

Why does a marketer need to know the basic principles of search engine optimization? It's simple: organic traffic is a great source of incoming target audience traffic for your corporate website and even landing pages.

Meet a series of educational posts on the topic of SEO.

What is a search engine?

The search engine is a large database of documents (content). Search robots bypass resources and index different types of content, it is these saved documents that are ranked in the search.

In fact, Yandex is a “cast” of the Runet (also Turkey and a few English-language sites), and Google is the global Internet.

A search index is a data structure containing information about documents and the location of keywords in them.

According to the principle of operation, search engines are similar to each other, the differences lie in the ranking formulas (ordering sites in search results), which are based on machine learning.

Every day, millions of users submit queries to search engines.

"Abstract to write":

"Buy":

But most interested in...

How is a search engine organized?

To provide users with quick answers, the search architecture was divided into 2 parts:

  • basic search,
  • metasearch.

Basic search

Basic search - a program that searches its part of the index and provides all the documents that match the query.

Metasearch is a program that processes a search query, determines the user's regionality, and if the query is popular, then it gives out a ready-made search option, and if the query is new, it selects a basic search and issues a command to select documents, then ranks the found documents using machine learning and provides user.

Search query classification

To give a relevant answer to the user, the search engine first tries to understand what he specifically needs. The search query is analyzed and the user is analyzed in parallel.

Search queries are analyzed by parameters:

  • Length;
  • definition;
  • popularity;
  • competitiveness;
  • syntax;
  • geography.

Request type:

  • navigation;
  • informational;
  • transactional;
  • multimedia;
  • general;
  • official.

After parsing and classifying the query, the ranking function is selected.

The designation of request types is confidential information and the proposed options are the guesswork of search engine promotion specialists.

If the user sets a general query, then the search engine returns different types of documents. And it should be understood that by promoting the commercial page of the site in the TOP-10 for a general request, you are claiming to get not one of the 10 places, but the number of places
for commercial pages, which is highlighted by the ranking formula. And therefore, the probability of being ranked in the top for such requests is lower.

Machine learning MatrixNet is an algorithm introduced in 2009 by Yandex that selects the function of ranking documents for certain queries.

MatrixNet is used not only in Yandex search, but also for scientific purposes. For example, at the European Center for Nuclear Research, it is used for rare events in large amounts of data (they are looking for the Higgs boson).

Primary data for evaluating the effectiveness of the ranking formula is collected by the assessors department. These are specially trained people who evaluate a sample of sites according to an experimental formula according to the following criteria.

Site quality assessment

Vitalny - official site (Sberbank, LPgenerator). The search query corresponds to the official website, groups in social networks, information on authoritative resources.

Useful (score 5) - a site that provides extended information upon request.

Example - request: banner fabric.

A site corresponding to the "useful" rating should contain information:

  • what is banner fabric;
  • specifications;
  • Photo;
  • kinds;
  • price list;
  • something else.

Top request examples:

Relevant+ (Score 4) - This score means that the page matches the search query.

Relevant- (Score 3) - The page does not exactly match the search query.

Let's say that the search for "Guardians of the Galaxy sessions" displays a page about a movie without screenings, a page of a past session, a page of a trailer on youtube.

Irrelevant (Score 2) - The page does not match the query.
Example: the name of the hotel displays the name of another hotel.

To promote a resource for a general or informational request, you need to create a page corresponding to the “useful” rating.

For clear queries, it is enough to meet the "relevant+" score.

Relevance is achieved through textual and link matching of the page with search queries.

conclusions

  1. Not all queries can promote a commercial landing page;
  2. Not all information requests can be used to promote a commercial site;
  3. By promoting a general request, create a useful page.

A common reason why a site does not reach the top is that the content of the promoted page does not match the search query.

We will talk about this in the next article “Checklist for basic website optimization”.

A search engine is a database of specific information on the Internet. Many users believe that as soon as they enter a query into a search engine, the entire Internet is immediately crawled, but this is not at all the case. Internet scanning occurs constantly, many programs, data about sites are entered into a database, where, according to certain criteria, all sites and all their pages are distributed into various lists and databases. That is, it is a kind of data file, and the search takes place not on the Internet, but on this file.

Google is the most popular search engine in the world.

In addition to the search engine, Google offers many additional services, software and hardware, including the mail service, the Google Chrome browser, the largest youtube video library and many other projects. Google is confidently buying up many projects that bring large profits. Most of the services are not aimed at a direct user, but at making money on the Internet and are integrated with a focus on the interests of European and American users.

Mail is a search engine popular mainly because of the mail service.

There are many additional services, the key of which is mail Mail, at the moment Mail owns the Odnoklassniki social network, its own My World network, the Money-mail service, many online games, three almost identical browsers with different names. All applications and services have a lot of advertising content. The social network "VKonatkte" blocks direct transitions to Mail services, arguing with a large number of viruses.

Wikipedia.

Wikipedia is a searchable reference system.

A non-profit search engine that exists on private donations, therefore it does not fill the pages with advertising. A multilingual project whose goal is to create a complete reference encyclopedia in all languages ​​of the world. It has no specific authors, is filled in and managed by volunteers from all over the world. Each user can both write and edit an article.

The official page is www.wikipedia.org.

Youtube is the largest video library.

Video hosting with elements of a social network, where each user can add a video. From the moment they were acquired by Google Ink, a separate registration for YouTube is not required, it is enough to register in the Google mail service.

The official page is youtube.com.

Yahoo! is the second most important search engine in the world.

There are additional services, the most famous of which is Yahoo mail. As part of improving the quality of the search engine, Yahoo transmits data about users and their queries to Microsoft. From these data, an idea of ​​the interests of users is formed, as well as a market for advertising content. The Yahoo search engine, as well as, is engaged in the absorption of other companies, for example, Yahoo owns the Altavista search service and the Alibaba e-commerce site.

The official page is www.yahoo.com.

WDL is a digital library.

The library collects books of cultural value in digital form. The main goal is to increase the level of cultural content of the Internet. Access to the library is free.

The official page is www.wdl.org/ru/.

Bing is a search engine from Microsoft.

The official website is www.baidu.com.

Search engines in Russia

Rambler is a "pro-American" search engine.

It was originally created as a media Internet portal. Like many other search engines, it has image search services, video files, maps, weather forecast, news section and much more. Publishers also offer a free browser Rambler-Nichrome.

The official page is www.rambler.ru.

Nigma is an intelligent search engine.

A more convenient search engine due to the presence of many filters and settings. The interface allows you to include or exclude suggested similar values ​​in the search to get better results. Also, when receiving a search result, it allows you to use information from other major search engines.

The official page is www.nigma.ru.

Aport - online catalog of goods.

In the past, the search engine, but after the fact that developments and innovations were discontinued, quickly lost ground and . At the moment, Aport is a trading platform, where goods from more than 1500 companies are presented.

The official page is www.aport.ru.

Sputnik is a national search engine and Internet portal.

Created by Rostelecom. It is currently in the testing phase.

The official website is www.sputnik.ru.

Metabot is a developing search engine.

The tasks of Metabot is to create a search engine for all other search engines, creating positions for issuing results, taking into account the data of the entire list of search engines. That is, it is a search engine for search engines.

The official page is www.metabot.ru.

The search engine has been suspended.

The official page is www.turtle.ru.

KM - multiportal.

Initially, the site was a multi-portal with the subsequent introduction of a search engine. The search can be carried out both within the site and on all tracked Runet sites.

The official page is www.km.ru.

Gogo - does not work, redirects to a search engine.

The official page is www.gogo.ru.

The Russian multiportal, which is not very popular, needs to be improved. The search engine includes news, TV, games, map.

The official page is www.zoneru.org.

The search engine does not work, the developers suggest using the search engine.

21.11.2017

Whatever question worries modern man, he does not look for answers in books. He searches for them on the Internet. Moreover, you do not need to know the address of the site on which the information you need is located. There are millions of such sites, and a search engine helps you find the right one.

In the vastness of our domestic Internet, the most popular two search engines are Google and Yandex.

Have you ever wondered how a search engine works? How does she understand which site to show, which of the millions of resources exactly has the answer to your request?

What is a search engine?

A search engine is a huge database of web documents that is constantly updated and expanded. Each search engine has search spiders, robots are special bots that bypass sites, index the content posted on them, and then rank it according to the degree of its quality and relevance to user search queries.

Search engines work to ensure that anyone can find any information. Therefore, they try to show first of all those web documents that have the most detailed answer to a person’s question.

At its core, a search engine is a directory of sites, a directory, the main function of which is to search for information in this very directory.

As I wrote above, we have two popular systems - Google (worldwide) and Yandex (Russian-speaking segment). But there are also such systems as Rambler, Yahoo, Bing, Mail.Ru and others. The principle of operation is similar for all of them, only the ranking algorithms differ (and even that is not very significant).

How a search engine works on the Internet

The principle of operation of search engines is very complicated, but I will try to explain in simple terms.

The search robot (spider) crawls the pages of the site, downloads their content and extracts links. Next, the indexer begins its work - this is a program that analyzes all materials downloaded by spiders, based on its own work algorithms.

Thus, a search engine database is created in which all documents processed by the algorithm are stored.

The search query is processed as follows:

  • the query entered by the user is analyzed;
  • analysis results are transferred to a special ranking module;
  • the data of all documents are processed, the most relevant to the entered query are selected;
  • a snippet is generated - title, description, words from the query are highlighted in bold;
  • search results are presented to the user in the form of a SERP (search page).

Search engine principles

The main task of any search engine is to provide the user with the most useful and accurate information on request. Because the search robot bypasses sites constantly. Immediately after your launch, according to a certain routine, the spider comes to visit you, bypasses a number of pages, after which they are indexed.

The principle of operation of search engines is based on two main stages:

  • crawling pages that collect data;
  • assignment of an index, thanks to which the system will be able to quickly search the contents of this page.

Once a website page is indexed, it will already appear in the search results for a specific search query. You can check if a new page has been indexed by a search engine using webmaster tools. For example, in Yandex.Webmaster you can immediately see which pages were indexed and when, and which ones fell out of the index and for what reason.

But what page it will end up on depends on the degree of indexing and the quality of its content. If your page gives the most accurate answer to a query, it will be above all others.

Principles of ranking sites in search engines

By what principle search robots work, we figured it out. But how are sites ranked?

Ranking is based on two main "pillars" - the text content of the page and non-text factors.

Text content is the content of the page. The more complete it is, the more accurate it is, the more relevant it is to the query, the higher the page will be in the search results. In addition to the text itself, the search engine pays attention to filling in the title (page title), description (page description), H1 (text title) tags.

Non-text factors These are internal linking and external links. The bottom line is: if the site is interesting, useful, it means that other thematic resources link to it. And the more such links, the more authoritative the resource.

But these are the most basic principles, very briefly. Let's delve a little deeper.

Mainwebsite ranking factors

There are a number of factors that affect the ranking of a site. The main ones are:

1. InInternal site ranking factors

This is the text on the site and its design - subheadings, highlighting important points in the text. The use of internal linking also applies here. Visual elements are also important: the use of pictures, photos, videos, graphs. The quality of the text itself, its content is also important.

2. External site ranking factors that determine its popularity. These are the same external links that lead to your site from other resources. Not only the number of these sites is determined, but their quality (it is desirable that the sites are similar to yours), as well as the overall quality of the link profile (how quickly these links appeared, naturally or through stock exchange purchases).

Based on the foregoing, one conclusion can be drawn: search engines try to work in such a way as to show the user those sites that give the most complete answer to his request and have already earned a certain authority. In this case, a variety of factors are taken into account: the content of the site, and its settings, and the attitude of users towards it. A site that is good in all respects will certainly take a high place in the SERP.

Hello, dear readers of the blog site. Being engaged or, in other words, search engine optimization, both at a professional level (promoting commercial projects for money) and at an amateur level (), you will definitely come across the fact that you need to know the principles of work in general in order to successfully optimize for them your site or someone else's.

The enemy, as they say, must be known in person, although, of course, they (for Runet it is Yandex and) are not enemies at all for us, but rather partners, because their share of traffic is in most cases prevailing and main. There are, of course, exceptions, but they only confirm this rule.

What is a snippet and how search engines work

But here, first you need to figure out what is a snippet, what is it for and why is its content so important for the optimizer? In the search results, it is located immediately below the link to the found document (the text of which is taken already written):

As a snippet, pieces of text from this document are usually used. The ideal option is to allow the user to form an opinion about the content of the page without going to it (but this is if it turned out to be successful, and this is not always the case).

The snippet is generated automatically and it decides which fragments of text will be used in it, and, importantly, for different requests, the same web page will have different snippets.

But there is a possibility that the contents of the Description tag can sometimes be used (especially in Google) as a snippet. Of course, this will also depend on the one in the search results of which it is shown.

But the contents of the Description tag can be displayed, for example, if the keywords of the query and the words you used in the description match, or if the algorithm itself has not yet found text fragments on your site for all requests for which your page gets into the search results of Yandex or Google .

Therefore, we are not lazy and fill in the contents of the Description tag for each article. In WordPress, this can be done if you use the one described (and I highly recommend you use it).

If you are a fan of Joomla, then you can use this material -.

But the snippet cannot be obtained from the reverse index, because it stores information only about the words used on the page and their position in the text. That's it for creating snippets of the same document in different search results (for different queries), our favorite Yandex and Google, in addition to the reverse index (needed directly for searching - read about it below), also save direct index, i.e. copy of the web page.

By storing a copy of the document in their database, it is then quite convenient for them to cut the necessary snippets from them, without referring to the original.

That. it turns out that search engines store in their database both the forward and reverse index of the web page. By the way, you can indirectly influence the formation of snippets by optimizing the text of the web page in such a way that the algorithm chooses exactly the piece of text that you have in mind as it. But we will talk about this in another article of the rubric.

How search engines work in general terms

The essence of optimization is to “help” search engine algorithms to raise the pages of the sites that you are promoting to the highest possible position in the search results for certain queries.

I put the word "help" in the previous sentence in quotation marks, because. with our optimization actions, we do not quite help, and often even prevent the algorithm from making output that is completely relevant to the request (about mysterious ones).

But this is the bread of optimizers, and until search algorithms become perfect, there will be opportunities to improve their positions in Yandex and Google search results through internal and external optimization.

But before proceeding to the study of optimization methods, it will be necessary to at least superficially understand the principles of the work of search engines in order to do all further actions consciously and understanding why this is necessary and how those whom we are trying to deceive will react to it.

It is clear that we will not be able to understand the entire logic of their work from and to, because much information is not subject to disclosure, but for us, at first, it will be enough to understand the fundamental principles. So let's get started.

How do search engines work anyway? Oddly enough, but the logic of work for all of them, in principle, is the same and is as follows: information is collected about all web pages on the network that they can reach, after which this data is cunningly processed in order to make it convenient for them. search. That, in fact, is all, this article can be considered complete, but still add some specifics.

First, let's clarify that a document is what we usually call a site page. At the same time, it must have its own unique address () and, noteworthy, hash links will not lead to the appearance of a new document (about that).

Secondly, it is worth dwelling on the algorithms (methods) for searching for information in the collected document database.

Direct and Inverse Index Algorithms

Obviously, the method of simply iterating through all the pages stored in the database will not be optimal. This method is called the algorithm direct search and while this method allows you to surely find the information you need without missing anything important, it is completely unsuitable for working with large amounts of data, because the search will take too much time.

Therefore, in order to work effectively with large amounts of data, an algorithm of inverse (inverted) indexes was developed. And, remarkably, it is he who is used by all the major search engines in the world. Therefore, we will dwell on it in more detail and consider the principles of its operation.

When using the algorithm reverse indices documents are converted into text files containing a list of all the words in them.

The words in such lists (index files) are arranged in alphabetical order, and next to each of them are indicated as coordinates those places in the web page where this word occurs. In addition to the position in the document, for each word there are other parameters that determine its meaning.

If you remember, in many books (mostly technical or scientific) on the last pages there is a list of words used in this book, indicating the page numbers where they occur. Of course, this list does not include in general all the words used in the book, but nevertheless it can serve as an example of building an index file using inverted indexes.

I draw your attention to the fact that search engines are looking for information not on the internet, but in the reverse indexes of web pages processed by them. Although direct indexes (original text) are also preserved, because it will later be needed to compose snippets, but we already talked about this at the beginning of this publication.

The reverse index algorithm is used by all systems, because it allows you to speed up the process, but at the same time, information loss will be inevitable due to distortions introduced by converting the document into an index file. For storage convenience, reverse index files are usually compressed in a tricky way.

Mathematical model used for ranking

In order to search by reverse indexes, a mathematical model is used to simplify the process of finding the necessary web pages (by the query entered by the user) and the process of determining the relevance of all found documents to this query. The more it matches a given query (the more relevant it is), the higher it should rank in the search results.

This means that the main task performed by the mathematical model is to search for pages in its database of reverse indexes corresponding to a given query and then sort them in descending order of relevance to a given query.

The use of a simple logical model, when the document will be found if the search phrase is found in it, will not work for us, due to the huge number of such web pages issued to the user for consideration.

The search engine must not only provide a list of all web pages that contain the words from the query. It should provide this list in such a way that the documents most relevant to the user's request will be at the very beginning (sort by relevance). This task is not trivial and cannot be done perfectly by default.

By the way, the imperfection of any mathematical model is used by optimizers, influencing in one way or another the ranking of documents in the search results (in favor of the site they are promoting, of course). The matmodel used by all search engines belongs to the vector class. It uses such a concept as the weight of a document in relation to a user-specified query.

In the basic vector model, the weight of a document for a given query is calculated based on two main parameters: the frequency with which a given word occurs in it (TF - term frequency) and how rarely this word occurs in all other pages of the collection (IDF - inverse document frequency). ).

The collection refers to the entire set of pages known to the search engine. Multiplying these two parameters with each other, we get the weight of the document for a given query.

Naturally, various search engines, in addition to the TF and IDF parameters, use many different coefficients to calculate the weight, but the essence remains the same: the weight of the page will be the greater, the more often the word from the search query occurs in it (up to certain limits, after which the document can be recognized as spam) and the less common this word is in all other documents indexed by this system.

Evaluation of the quality of the work of the formula by assessors

Thus, it turns out that the formation of results for certain requests is carried out completely according to the formula without human intervention. But no formula will work perfectly, especially at first, so you need to control the operation of the mathematical model.

For these purposes, specially trained people are used - who view the results (specifically of the search engine that hired them) for various queries and evaluate the quality of the current formula.

All comments made by them are taken into account by the people responsible for setting up the mathematical model. Changes or additions are made to its formula, as a result of which the quality of the work of the search engine increases. It turns out that the assessors play the role of such a kind of feedback between the developers of the algorithm and its users, which is necessary to improve the quality.

The main criteria in assessing the quality of the work of the formula are:

  1. Search engine accuracy is the percentage of relevant documents (corresponding to the query). The fewer non-relevant web pages (for example, doorways) that are present, the better
  2. Completeness of search results - the percentage of web pages that match a given query (relevant) to the total number of relevant documents available in the entire collection. Those. it turns out that in the entire database of documents that are known to the search for web pages corresponding to a given query, there will be more than shown in the search results. In this case, we can talk about the incompleteness of the issuance. It is possible that some of the relevant pages fell under the filter and were, for example, mistaken for doorways or some other slag.
  3. The relevance of the issuance is the degree to which a real web page on a site on the Internet matches what is written about it in the search results. For example, a document may no longer exist or be heavily modified, but at the same time it will be present in the search results for a given request, despite its physical absence at the specified address or its current non-compliance with this request. The relevance of the issuance depends on the frequency of scanning by search robots of documents from their collection.

How Yandex and Google collect their collection

Despite the seeming simplicity of indexing web pages, there are a lot of nuances that you need to know, and later use when optimizing (SEO) your own or custom sites. Network indexing (collection collection) is carried out by a specially designed program called a search robot (bot).

The robot receives an initial list of addresses that it will have to visit, copy the contents of these pages and give this content to the algorithm for further processing (it will convert them into reverse indexes).

The robot can walk not only on a list given to it in advance, but also follow links from these pages and index documents located on these links. That. the robot behaves in exactly the same way as a normal user following links.

Therefore, it turns out that with the help of a robot, you can index everything that is usually available to a user using a browser for surfing (search engines index line-of-sight documents that any Internet user can see).

There are a number of features associated with the indexing of documents on the web (recall that we have already discussed).

The first feature can be considered that in addition to the reverse index, which is created from the original document downloaded from the network, the search engine also saves a copy of it, in other words, search engines also store a direct index. Why is this needed? I already mentioned a little earlier that this is necessary for compiling different snippets depending on the entered query.

How many pages of one site Yandex shows in the search results and indexes

I draw your attention to such a feature of Yandex's work as the presence in the issuance of a given request for only one document from each site. Until recently, it was impossible to have two pages from the same resource at different positions in the SERP.

This was one of the fundamental rules of Yandex. Even if there are a hundred pages relevant to a given query on one site, only one (the most relevant) will be present in the search results.

Yandex is interested in the user receiving a variety of information, and not scrolling through several pages of search results with pages of the same site that this user was not interested in for one reason or another.

However, I’m in a hurry to get better, because when I finished this article, I found out the news that Yandex began to allow the second document from the same resource to be displayed in the issuance, as an exception, if this page turns out to be “very good and relevant” (in other words, highly relevant to the request).

Remarkably, these additional results from the same site are also numbered, therefore, because of this, some resources that occupy lower positions will fall out of the top. Here is an example of a new Yandex issue:

Search engines strive to evenly index all websites, but often this is not just because of the completely different number of pages on them (someone has ten, and someone has ten million). How to be in this case?

Yandex gets out of this situation by limiting the number of documents that it can drive into the index from one site.

For projects with a second-level domain name, for example, a website, the maximum number of pages that can be indexed by the Runet mirror is in the range from one hundred to one hundred and fifty thousand (the specific number depends on the relationship to this project).

For resources with a third-level domain name - from ten to thirty thousand pages (documents).

If you have a site with a second-level domain () and you need to index, for example, a million web pages, then the only way out of this situation is to create many subdomains ().

Subdomains for a second-level domain might look like this: JOOMLA.site. The number of subdomains for the second level that Yandex can index is somewhere a little over 200 (sometimes it seems to be up to a thousand), so in this simple way you can drive several million web pages into the Runet mirror index.

How Yandex treats sites in non-Russian domain zones

Due to the fact that until recently Yandex searched only on the Russian-language part of the Internet, it indexed mainly Russian-language projects.

Therefore, if you are creating a site not in domain zones, which it classifies as Russian by default (RU, SU and UA), then you should not wait for a quick indexing, because. he, most likely, will find it not earlier than in a month. But the subsequent indexing will take place with the same frequency as in Russian-language domain zones.

Those. the domain zone affects only the time that will pass before the start of indexing, but will not affect its frequency in the future. By the way, what does this frequency depend on?

The logic of the work of search engines for re-indexing pages boils down to the following:

  1. having found and indexed a new page, the robot visits it the next day
  2. comparing the content with what it was yesterday, and not finding differences, the robot will come to it again only after three days
  3. if this time nothing changes on it, then he will come in a week, etc.

That. over time, the frequency of the robot coming to this page will equal or be comparable to the frequency of its update. Moreover, the re-entry time of the robot can be measured for different sites both in minutes and in years.

Such are they smart search engines, making an individual visit schedule for various pages of various resources. It is possible, however, to force the search engines to re-index the page at our request, even if nothing has changed on it, but more on that in another article.

We will continue to study the principles of how search works in the next article, where we will look at the problems that search engines face, consider the nuances of . Well, and much more, of course, one way or another helping.

Good luck to you! See you soon on the blog pages site

You may be interested

Rel Nofollow and Noindex - how to block external links on the site from indexing by Yandex and Google
Accounting for the morphology of the language and other problems solved by search engines, as well as the difference between HF, MF and LF queries
Website trust - what it is, how to measure it in XTools, what affects it and how to increase the authority of your site
SEO terminology, acronyms and jargon
Relevance and ranking - what is it and what factors affect the position of sites in the issuance of Yandex and Google
What search engine optimization factors affect website promotion and to what extent
Search engine optimization of texts - the optimal frequency of using keywords and its ideal length
Content for the site - how filling with unique and useful content helps in modern website promotion
Title, description and keywords meta tags hinder promotion
Yandex updates - what are they, how to track up Tits, changes in search results and all other updates

Similar posts