vaes9

Apostrophes and Google Don’t Mix

8:23 pm PHT

If you’ve ever tried phrase searching in Google then try this query:

site:alistapart.com "ever trust the 8-bit representations"

Got one result? It should lead you to an A List Apart article entitled “The Trouble With EM ’n EN.” Notice that the highlighted phrase search in the results page has the word “don’t” in front of it. So if we add that word to the phrase in the search query, we should get the same result, right?

site:alistapart.com "Don't ever trust the 8-bit representations"

Oops. The page is no longer to be found! What happened?

The problem is that A List Apart uses what we call typographical apostrophes, which is advocated by that example article itself. Those apostrophes are also called smart apostrophes or curly apostrophes. They are a different character from straight apostrophes, which you can easily input using the key immediately to the left of your Enter key on your US-layout keyboards (or two keys to the left of the Enter key depending on where the backslash key is located).

Many sites, like this very blog, convert straight apostrophes and straight quotes into curly apostrophes and curly quotes to look more professional. Blogs, especially those powered with the SmartyPants plug-in, have this feature too.

While there are some ASCII junkies that protest this “frivolity,” the Unicode standard itself suggests that the character “U+2019 RIGHT SINGLE QUOTATION MARK” be used to represent apostrophes, especially those used to represent contraction and possessive forms in English. The relevant passages, taken from the Unicode Technical Report #8, is shown below:

For historical reasons, U+0027 is a particularly overloaded character. In ASCII it is used to represent a punctuation mark (such as right single quotation mark, left single quotation mark, apostrophe punctuation, vertical line, or prime)….

In the case of an apostrophe… U+2019 RIGHT SINGLE QUOTATION MARK is preferred where the character is to represent a punctuation mark, as in “We’ve been here before.” In the latter case, U+2019 is also referred to as a punctuation apostrophe.

So what has this got to do with Google? Apparently, Google cannot recognize this character to be equivalent to a plain apostrophe especially when it’s used in the middle of words, as in “don’t” or “Mary’s.” If you do phrase searching in Google where the phrase includes words having apostrophes, web pages containing that same phrase but using curly apostrophes won’t get returned in the results.

This is bad. Web sites that try to add a touch of professionalism and following what the standards suggest may get penalized in terms of traffic from search results.

There is unfortunately very, very little literature about this on the Web. The only reference I can find to this problem is the 7th post on this discussion thread on a search engine meta site.

Filed under and

Add your comment | 1 comment

Comments

Comment times are in Philippine time (+0800).

1

On 10:17 p.m., 20 Sep 2005, Eric Baillargeon wrote:

You got the same problem with the majority of Desktop Search. Copernic or Google Desktop will not show any document with a single apostrophe query, curly or not!

Post your comment here

Comments moderated: Comments for this entry is now moderated. That means that the author will have to approve the comment before it can be viewed by the public.

Remember The Data Above? (Uses Cookies)

Comment shown to:

Comment notes

Your name and e-mail address are required. Your e-mail won't be displayed.