There has been a lot written about latent semantic indexing and what it means. In fact, only Google know what Google are referring to when they use the term and statisticians probably have a different version.
Latent Semantic Analysis is a mathematical/statistical means of assessing the use and meaning of language. Words and words strings can be used to determine the context in which they are used. LSI can be regarded as a misnomer, in that the term Latent Semantic 'Indexing' is basically meaningless. Indexing can be carried through the use of Latent Semantic 'Analysis' as a tool to enable a web page to indexed for specific keywords, but nothing can be indexed 'latent semantically.
However, as usual I am likely being pedantic, so will use the common term LSI rather than the correct one of LSA. Pedantism in language and the way it is used, however, is probably the only way that language will continue to be used in a way that is not ambiguous and left to interpretation, and that is precisely the reason for Google adopting this algorithm to improve its service to its users.
Don't forget that a search engine's customers are not you and I that try to make money from it, but those that use it to find information. The more accurately that information meets the needs of the user, then the more likely that user is to use that search engine next time.
Latent semantic indexing does not use any human dictionary, and its input is character strings that it defines into words, sentences and paragraphs. This is a very simplistic description of a complex statistical and mathematical computation, but the upshot is that it cannot be fooled by the excessive use of keywords as sole content of a webpage.
To put it even more simply, LSI allows searchers to find information that contains no words used in their queries. Hence, if you are looking for information on calligraphy, it will be provided if you use words such as writing, penmanship and script in your search term. Before the adoption of LSI, such searches would provide pages with information on “calligraphy” only if the word also contained in the search term or ‘keyword’.
It has been adopted by search engines, specifically Google, in order to dilute the requirement for the presence of specific search terms within a web page or article in favor of content that the search term used implies would meet the needs of the person carrying out the search. In other words, LSI is intended to provide a better service to Google’s customers. It renders irrelevant any content which contains little but specific keywords, and punishes word repetition to the extent that even 2% keyword density could, in some cases, be regarded as keyword ‘stuffing’
For this reason, the use of ‘keywords’ as such are now of less significance than prior to the adoption of the LSI concept by search engines. It is no longer necessary for content to contain between 1% and 3% ‘keyword density’ as was, and still is, recommended. In fact, no keywords are necessary at all, though it is still useful to set the theme of your page with a ‘keyword’ in the title, introduce one at the beginning of the text and finish with one near the end. The remainder of the text should be rich in content which is relevant to the theme of the page, or what was once called the ‘keyword’, such as ‘writing’, penmanship’ and ‘script’ are relevant to the topic of ‘calligraphy’.
If the search engines take this to a logical conclusion, and there is no reason why they should not, many businesses that rely on keyword research and suchlike will have to adapt to survive. Wordtracker will lose popularity and the old thesaurus will once more become king.
Software which uses synonym replacement will have to become more sophisticated, and ensure not only that synonyms are not repeated but also that they are true synonyms with grammatical and semantic relevance to the context in which they are used. This has been sadly lacking in all of such software that I have purchased for my review sites.
Let's have a look at two possible practical examples of LSI at work, and these will likely make the whole concept a good deal clearer to you. Take the partial phrase "When a spider crawls the web, it is looking for. . . "
What does this mean to you? Does it mean that an arachnid is seeking flies? Or, does it means that a search engine is crawling the World Wide Web looking for pages that meet certain criteria? You don't know the answer to that until you read the rest of the text. That is LSI at work in your brain. Now, a computer might be able to compute, but as yet it has no brain and can base its results only according to predefined rules.
These rules make up the algorithm programed to the rules of LSA: once it sees the word fly, it associates with the spider being an arachnid. If sees the word page or site, it will associate it with a website. The rest of the text on the page will enable the algorithm to correctly index that page and somebody seeking for information on how spiders detect flies on their webs won't end up with page after page about internet marketing!
Another example: A web page is titled "A History of Locks". What does that mean? It likely doesn't mean the history of locks of hair since that is senseless, but how does a piece of software know that? It could be that to an algorithm that doesn't think like humans. It could equally be a history of canal locks or security locks, each of these being highly plausible. So how does the spider, or algorithm, determine the subject so that the user of the search engine is not given useless information?
LSA! It is programmed to know that the character strings 'barge', 'longboat', 'canal' and so on will relate to canal locks, and that 'keys', 'keyhole', 'security', etc. will relate to security locks. You have no need to use the word 'lock' over and over again as a keyword: the semantics of the rest of the vocabulary will provide the algorithm with the answer. The word 'semantics' refers to the meaning of words, in this case the meaning of the word 'lock'.
Google, and other major search engines, are using this concept to determine what website content is really about: what it is really saying. It is catching out pages written specifically to get listed for individual keywords, but that have little useful content other than meaningless repetitions of the keyword.
Many webpages which, until recently, have been highly listed by Google and other search engines, have disappeared overnight after being subject to scrutiny such as they have never had before and have been found wanting. If the content of your webpages is relevant to the topic or theme of the website, and if you can honestly say that you would find them interesting were you searching for the information they claim to provide, then your pages should be safe. Bear in mind that the search engines treat every single page separately and that while websites are not delisted, only individual pages.
Also, bear in mind that if you are linked to any webpage that is considered substandard by Google your own page might suffer. If that webpage is then dropped from the listings and subsequently deleted by the webmaster your link will become a broken link which search engines detest. Therefore, I advise you to make a regular check of your links to make sure they are all live. There is software available to help you do this if you have too many to check manually.
It is important to understand that LSI is not a technique, as such. It is concept born of complex statistical analysis and the idea that latent semantic indexing can be used to improve a webpage is blatant nonsense. LSI cannot be used as such, and SEO sites that claim to able to write LSI friendly websites are doing this through ignorance. There is no such thing as a technique that can make a webpage LSI friendly or compliant as some companies advertising on the internet claim.
The term ‘keyword’ or ‘key phrase’ might become history while search engines use a 'semantic' concept for assessing webpage content. What will remain true is that Google will continue to work to satisfy Google's customers and ensure as far they can that websites generated purely for profit, and not to provide information, will not see the light of day on its result pages, and that the current form of content analysis will be refined to that end.
The day will come when you will type 'locks' into your search engine and the result will provide exactly the type of 'locks' that you are thinking of. Or perhaps not. I the future It will lilely only be required that you think of it!
BACK to SEARCH ENGINES