Intelligence in the search engine

How does intelligence get into a search engine?

Let’s assume that you are building a search engine. In the process, you do not want to avail yourself of the services of expensive and not always faultless domain experts, but solely build the search engine with sufficient data servers (the hardware for the corpus) and an ingenious software. In principle, you will use a neural network with a corpus. How do you inject intelligence into your system?

Trick 1: Let the customers train the corpus

As in the tank AI of previous blog posts, a search engine depends on categorisations, this time provided by customers’ allocation of input texts (search string) to a list of web addresses which might be interesting for their searches. To find the relevant addresses, your system is again based on a learning corpus, which this time consists of the list of your previous customers’ search inputs. The web addresses which the previous customers have clicked from among those offered to them are qualified as positive hits in the corpus. When it comes to new queries – also from other customers – you simply indicate the addresses which have received most clicks to date. They can’t be all that bad, after all, and the system gets more refined with every query and the following click. And it still applies that the bigger the corpus, the more precise the system.

Again, the categorisations originate outside the system as they are provided by people who have assessed the selection offered to them by the search engine by placing their clicks according to their preferences. They did so

  • with their human intelligence and
  • in line with their individual interests.

The second point is particularly interesting. We might have a closer look at this later.

Trick 2: Assess the customers at the same time

Not every categorisation by every customer is equally relevant. As a search engine operator, you can optimise two directions:

  • Assess the assessors:
    You know all your customers’ inputs, so you can easily find out how reliable these customers’ categorisations, i.e. the web addresses they clicked in connection with their search strings, are. Not all the customers are equally proficient in this respect. The more other customers click the same web address for the same search string, the safer the categorisation will also be for future queries. You can now use this information in order to weight your customers: the customer who has so far had the most reliable categorisations, i.e. the one who most often chose what the others also chose, is given most weight. A customer who was followed by fewer others will be regarded as less reliable. This weighting process will increase the probability that the future search results will rate those websites higher which are of interest to most customers.
  • Assess the searchers:
    Not every search engine user has the same interests. You are able to take this into consideration since you know all their previous inputs. You can make use of these inputs to generate a profile of this customer. This will naturally enable you to select the search results for him or her accordingly. Assessors with a profile similar to the searcher’s will weight the potential addresses similarly, too, and you will be able to personalise the search results even more in the customer’s interest.

For you as a search machine operator, it is in any case worth generating a profile of all your customers for an improvement in the quality of search suggestions alone.

Consequences

  1. Search engines become more precise the more they are used.
    This applies to all the corpus-based systems, i.e. to all technologies with neural networks: the larger their corpus, the higher their precision. They can be capable of amazing feats.
  2. A remarkable feedback effect can be observed in this connection: the bigger the corpus, the better the quality of the search engine, which is why it is used more often, which in turn enlarges its corpus and thus boosts its attractiveness in comparison with competitors. This effect inevitably results in such monopolies as are typical of all applications of corpus-based software.
  3. All the categorisations were primarily made by human beings. The basis of intelligence – the categorising inputs in the corpus – is still provided by human beings. In the case of search engines, these are all the individual users who in this way input their knowledge into the corpus. Which means that the intelligence in AI is not all that artificial after all.
  4. The tendency towards bubble formation is inherent in corpus-based systems: if search engines generate profiles of their customers, they can offer them better search results. In a self-referential loop, this inevitably leads to bubble formation: users with similar views are brought increasingly closer together by the search engines since in this way, these users are provided with the search results which correspond most closely to their individual interests and views. They will come across deviating views less and less often.

The next post will be about a further important aspect of corpus-based systems, namely the role of probability.

This is a post about artificial intelligence.


Translation: Tony Häfliger and Vivien Blandford

Leave a Reply

Your email address will not be published. Required fields are marked *