In order to be fast and efficient (i.e. to give relevant results), the search engine uses a system of content indexing. Indexing content means to analyse all the textual content in SPIP, extract all the words, and for each word store its location.
Similar to a book index showing all the important words contained in the book, cross-referenced with the page numbers where they are found, the search engine associates with each word used on the site the number of the article, news item, etc. where the word is found.
Then, when you use the search engine to carry out a search, SPIP doesn’t need to look at the whole texts of the site, but simply checks the index to find out where these words are found. Thus, a lot of time is saved (and the bigger the site, the more time is saved).
Here is how it works: take some text (long or short), extract all the words, record each of these words in a database with a reference to where the word is found.
For example, our site has three articles, whose (very short) texts are:
- article 1: "The very little mouse died of cold and hunger."
- article 2: "A very large mouse returned to the house."
- article 3: "A house resists cold."
We’re going to extract the words from each article and record for each word which article it appears in (we only take words with more than three letters, we will explain why later):
- little: 1
- mouse: 1, 2
- died: 1
- cold: 1, 3
- hunger: 1
- very: 1, 2
- large: 2
- returned: 2
- house: 2, 3
- resists: 3
And so on, considering that our site is certainly much larger, and the articles much longer.
If we search for the word mouse:
- the solution without indexing would consist of re-reading all the articles to find the word mouse; on a large site, even for a computer, this would take a long time
- since we have an index, we only need to lookup the entry for mouse: we know immediately that it’s found in articles 1 and 2.
Indexing is completed by another element: weighting. This makes the search engine results more relevant. For example, if a word appears in the title of an article, and in the body text of another article, we assume that if you’re searching for this word then the first article returned would be the one with the word in its title. Furthermore, if a word appears 25 times in an article and only twice in another, the first article returned would be the one where there are more occurrences of the word.
Simple Indexing is not sufficient. If you search for mouse, you can find the articles where the word appears, but then you can’t further sort the articles themselves (according to whether the word mouse appears once or 20 times, or whether it’s found in the title or simply in the body text...).
So, SPIP must calculate the weighting for each word in each article. This means giving points to a word according to where it’s found and the number of occurrences:
|in the title||8 points|
|in the sub title||5 points|
|in the top title||5 points|
|in the brief description||4 points|
|in the deck||3 points|
|in the text||1 point|
|in the postscript||1 point|
If there many occurrences of the word, they are added up.
For example, if in an article the word mouse appears:
- once in the title: 8 points
- twice in the brief description: 2 x 4 = 9 points
- six times in the text: 6 x 1 = 6 points
- total: 8 + 8 + 6 = 22 points.
In the index the word mouse is entered as:
- mouse, in article number X, 22 points;
- mouse, in article number Y, 15 points;
If you search for the word mouse, thanks to the index, you can find that it appears in articles X and Y, and we can further classify between them: 22 points in X, 15 points in Y, to work out that article X is more relevant to the search.
Keywords: many users mistake keywords for indexing. Keywords haven’t, by their nature, any link with indexing: when SPIP performs a search, it doesn’t search in the keywords associated with articles (this would be very limited), it searches in the index created from the exact text of the articles. So keywords are not involved in the basic principle of indexing.
However, keywords are used when creating the index of the words in articles. If a keyword is associated with an article then its contents are indexed as if they were part of the article itself, with a strong weighting (12 points for the name of the keyword, 3 points for its description). Thus, if our article Y (15 points taking into account its own contents) is associated with the keyword mouse (indicating that this is the subject of the article) then we need to add 12 points to the indexing of the article for the keyword mouse: total 27 points. In our search for mouse article Y now steps ahead of article X.
Finally we note that all of SPIP’s elements are subject to indexing: articles, news items, sections, authors, keywords, referenced sites (if the site is syndicated, the titles of the articles drawn from this site also enter into the indexing).
Where is it stored?
The indexing data is stored directly in the database. This is a bit technical and I will not spend a lot of time on this subject.
Suffice it to say, there are many indexes (lists of words), each corresponding to a type of content (an index for articles, an index for sections, one for news items…). Furthermore, contrary to the explanation above, where the entries in the index are words, in SPIP the entries in the index are numbers calculated from these words ("hashes"); another table in the database stores the correspondence between the actual words and these numbers; this method accelerates searches across the index (it’s faster for the application to search across numbers than words).
The size of indexes is very important. On the uZine site, for example, the part of the database dedicated to storing articles is 9.7MB. The part which stores the index for these articles is 10.5MB. And the part which stores the correspondence between words and their multiple translations is 4.1MB. So, for a site taking up 9.7MB, the indexing of the articles itself represents 14.6MB. The space required for the articles and the search has doubled the size required for the site. So this is one of the reasons you may have for preferring to disable the search engine, if you have restrictive hosting limitations.
Which words are indexed?
We’ve seen that all words of all elements of the site are extracted, then analysed (for weighting). However, SPIP does not take into account all the words.
- HTML code is excluded from indexing. This is necessary to obtain "clean" searches.
- Words of less than 4 letters are not taken into account (in reality, less than 3 letters, but words of 3 letters are not for the moment included in the searches). This point raises many questions from users…
The problem is to obtain results as relevant as possible. We have to highlight the really important words in the context of the search. For example, if we search for the mouse, the important word is mouse and not the…
Going back to our first example (with three articles each consisting of a sentence). If we had indexed all the words, we would have:
- the: 1, 2
- of: 1
- to: 2
Searching for the words the cold is very dangerous, we would find the entries:
- the: 1, 2
- cold: 1, 3
- is: nothing
- very: 1, 2
- dangerous: nothing
Adding the results of these words for each article (in reality, the weighting of each article, but we consider that for our example each word has a weighting of only one point), we obtain:
- article 1: 3 words
- article 2: 2 words
- article 3: 1 word
The classification would put article 1 at the top then article 2 and then article 3. However, article 2 does not mention cold, as article 3 does. Because we took into account words unimportant to the meaning of the search (the, very) the sorting is incorrect.
Indexing needs to disregard words which are not important to the meaning of the search. The best method would consist of providing a list of words not to index; however, this would be an enormous task, the creation of an exclusion dictionary (and one for each language)… So, more simply, in SPIP we chose to consider words of three letters and less as "unimportant"; obviously, there are casualties, as words like cat, sea, hat… are no longer taken into account; it’s a compromise judged on the efficiency of searches (which are generally of good quality).
NB. Since version 1.6 of SPIP, acronyms of two letters or more, including those which contain numbers (G8, VAT, PHP, AOL...) are indexed. An acronym is considered as a word which contains no lowercase letters.
When does indexing take place?
Indexing takes place at three different moments:
- when you validate an article, it is immediately indexed;
- when you modify an article that is already published, it is re-indexed;
- when, during a visit to the public site, an element accessible to the public is found to be not already indexed. This may be the case for example, if the indexing data has been erased via the private space, or if you reinstate a backup of your site – the indexes are not saved –, or if you activate the search engine after having already published articles. In these cases, the index is built as a background task.
Remember however that the operation of indexing is relatively heavy: it demands a number of calculations (not complex calculations, but nevertheless performed on all the words of an article), and makes a large number of calls to the database. With a slow host (only a really very slow one!) it is perhaps preferable to disable the search engine.
Remember also that if you activate the search engine after having already published some articles, these are not immediately indexed: the visits to the public site triggers their indexing. For a large site this could take some time.
Since all documents are being indexed, you can now perform searches.
If you search for a single word...
SPIP will look at the index, and find the appropriate entry for this word. For the word mouse, we would have found the articles X and Y. SPIP will then go and lookup the number of points attributed to this word for each article (22 points in X, and 27 point in Y). We can then classify the results: article Y, then article X.
If you search for multiple words…
SPIP does not allow boolean searches like "AND", "OR", it doesn’t work in this way.
When you search for multiple words, SPIP will perform the search task for each word, lookup the points for each article (or news item, or section, etc.) and add them up.
For example, searching for the words mouse, large, house we obtain the following results for each word:
- mouse: article X (22 points), article Y (27 points)
- large: article X (12 points), article Y (2 points), article Z 5 points)
- house: article Y (3 points), article Z (17 points)
SPIP adds up all these points for each article:
- article X: 22 + 12 = 34 points
- article Y: 27 + 2 + 3 = 32 points
- article Z: 5 + 17 = 22 points
The sorting of the articles, for the search (mouse, large, house) is: article X, then article Y, then article Z. (In the site’s templates you can output the points of each result using the tag
#POINTS - refer to the article about The search loops and tags.)
This isn’t a search like "AND" or "OR", it is an addition of points. And its use is quite effective (you find what you’re looking for, which is the goal of the search engine...).