Ridding SpamBots – reCaptcha Won’t Work but the Honeypot Method Will

How do we get rid of these SpamBots on our site?

Every site falls victim to SpamBots at some point. How you handle it can affect your customers, and most solutions can discourage some people from filling out your forms.

That’s where the Honeypot Technique comes in. It allows you to ignore SpamBots without forcing your users to fill out a reCaptcha or jump through other hoops to fill out your form.

This article is purely to help others implement a Honeypot Trap on their website forms.

Since implementing the Honeypot Method on all of my client’s websites, we have successfully blocked 99.5% (thousands of submissions) of all our spam. That is without using the techniques mentioned in the “advanced” section, which will be implemented soon.

Sorry Google, No CAPTCHA or reCAPTCHA will stop the SpamBots

Google recently launched a new version of reCaptcha which claims to be more robust to bots and easy going on the humans.

While this video on youtube by Google is pretty convincing too, things got a little interesting when we dug deeper. The new approach which seems to be a sophisticated bot identification algorithm, is nothing but a mere usage of browser cookies.

So here’s what happens when you are thrown a reCAPTCHA challenge:

  • You are asked to solve a reCAPTCHA image the first time.
  • The response to the evaluation of the text string entered by you, is cached in your browser’s cookies.
  • The next time you visit the page, or any page which requires you to pass reCAPTCHA, the information from these cookies is used to identify whether you have passed the test before.

A simple test can be done here: https://wordpress.org/support/register.php.

After solving the reCAPTCHA image for the first time, it does not require you to solve an image when you visit again. But, once you delete your cookies, and try again … there! Back to square one, you are required to solve the image to succeed the form submission. Google has simply used cookies to retain information about your authenticity.

What does this mean for bots? Now bots can use an OCR tool to solve the information or require somebody to solve the image initially, post which, the bot can retain the cookies and continue scraping!

The new version of reCAPTCHA can also be bypassed by another technique. This can be done by using the website’s public key (called data-sitekey). Wait, what? Yes! Let’s say a bot wanted to bypass a website X’s reCAPTCHA without actually letting a user (on website Y) know that he is allowing a bot to do so. More technically, this is called clickjacking or UI redress attack. The bot could use the data-sitekey of website X and disable the Referer header on a web page in Y where the user would be asked solve reCAPTCHA.

Once the user solves the CAPTCHA, the response (called “g-recaptcha-response”) can be used by a bot running in the background to submit a form on website X. This way, the bot could trick Google into thinking that the solved reCAPTCHA response was originating from website X (while it is actually coming from Y). Hence, the bot is able to proceed scraping on webiste X. This magically works because Google doesn’t validate the referer header if it has been disabled by the client or is empty. A genuine user just contributed to a bot scraping website X without actually realizing that he was being used as an access card.

The Honeypot Method

By adding an invisible field to your forms that only spambots can see, you can trick them into revealing that they are spambots and not actual end-users.

HTML

<input type=”checkbox” name=”contact_me_through_PTC_only” value=”1″ style=”display:none !important” tabindex=”-1″ autocomplete=”off”>

Here we have a simple checkbox that:

  • Is hidden with CSS.
  • Has an obscure but obviously fake name.
  • Has a default value equivalent 0.
  • Can’t be filled by auto-complete
  • Can’t be navigated to via the Tab key. (See tabindex)

Server-Side

On the server side we want to check to see if the value exists and has a value other than 0, and if so handle it appropriately. This includes logging the attempt and all the submitted fields.

In PHP it might look something like this:

$honeypot = FALSE;

if (!empty($_REQUEST[‘contact_me_through_PTC_only’]) && (bool) $_REQUEST[‘contact_me_through_PTC_only’] == TRUE) {

    $honeypot = TRUE;

    log_spambot($_REQUEST);

    # treat as spambot

} else {

    # process as normal

}

Fallback

This is where the log comes in. In the event that somehow one of your users ends up being marked as spam, your log will help you recover any lost information. It will also allow you to study any bots running on you site, should they be modified in the future to circumvent your Honeypot.

Reporting

Many services allow you to report known SpamBot IPs via an API or by uploading a list. (Such as CloudFlare) Please help make the internet a safer place by reporting all the SpamBots and spam IPs you find.

Advanced

If you really need to crack down on a more advanced SpamBot, there are some additional things you can do:

  • Hide Honeypot field purely with JS instead of plain CSS
  • Use realistic form input names that you don’t actually use. (such as “phone” or “website”)
  • Include form validation in Honeypot algorithm. (most end-user will only get 1 or 2 fields wrong; SpamBots will typically get most of the fields wrong)
  • Use a service like CloudFlare that automatically blocks known spam IPs
  • Have form timeouts, and prevent instant posting. (forms submitted in under 3 seconds of the page loading are typically spam)
  • Prevent any IP from posting more than once a second.
  • For more ideas look here: How to create a “Nuclear” Honeypot to catch form spammers

Build a really smart Honeypot

That may seem obvious, but here are a few tricks(Details later):

  • Think Like a SpamBot
  • Assume that they are able to know what is on screen or behind other elements
  • Have multiple traps.
  • Time Trap
  • Set the Honeypot

If you want to go further, and maybe you should

1. Think like a SpamBot:

Start going through your page like a SpamBot, You can even write your own which can waist time but is quite fun. Most SpamBots will crawl through the markup looking for a <form> element. Then they will look at your inputs and fill them in appropriately, which is the catch: how do they know what to fill in. They will probably look at the Id, class, placeholder, and label. which brings us to our first method

Method #1:

Mislabel inputs in your form code. Basically your username input should have the Id of #Form_Email and the SpamBot fills out the form incorrectly. Also hide and mislabel your inputs labels, use divs instead.*

Method #2 starts here

You’ve probably noticed that if you simply ignore hidden stuff, based on location what is in front of it and even the good old display: none;,visibility: hidden;,opacity: 0; or type=’hidden’. This gives us a powerful weapon. I discovered this by accident while testing a time trap. I used a basic form filler to fill the form. On my site, the register form is in a dialog that opens when a user clicks a register button. By default it is hidden. This gave me an idea for Method #2, Default: form is hidden. Basically, your form is hidden on page load, but is uncovered by some mouse based action. If you want your form to be visible on page load, add a I identical decoy one which is above the real one in the markup.If the bot fills in and submits it, block its Ip for a few minutes.** For real users, simply when the mouse hovers over the decoy form, switch them around.

2. Assume that they know what your page looks like

Assuming that hiding Honeypot with CSS is perfect is a grave mistake. There are a lot of super smart screen readers like JAWS that could be repurposed for Spaming. That is why you have multiple lines of defense.

3. Have multiple traps

  • Time Traps: Going back to thinking like a Bot, would you want to wait on a site instead of attacking others? Method #3:Create a time trap. The best way is to print a time in a hidden input when the page loads. When you submit the form, it tells you how long it took. Fill the form as fast as you can. That should be the minimum amount of time to fill your form. Note: encrypt your time stamp so bots cannot change it.
  • If you want to get really fancy, measure the WPM of the Bot typing. This is done on stack exchange (try copying and pasting then submitting a question/answer). Also if the rate of typing is very consistent, that is a red flag.
  • Honeypots (Method #4): Use all of the above at once for best results. Make sure to trick dumb bots as well as smart bots (don’t assume the Bot is always trying hard.).

Now, in order to spam us, bots will have to have cursors, render the page, wait, type at a variable realistic speed. If they make a Bot like that, we’ll think up more stuff.

*People using screen readers will trigger or be confused by these defenses, and depending on your country you could get into trouble for discriminating against blind to semi-blind people. Therefore, when a user triggers the bot test, take them to a non loaded form with a disability friendly captcha like reCaptcha.

**People often share Ips and you can chase away valid users.

P.S. Use simple honey pots like you already have. Some bots are just too dumb to get tricked by what we have here.

PTC Computer Solutions advises businesses on technology that might help their interests. If your business is interested in getting ahead of the competition, contact us for more on how to use technology to your advantage. It’s what we do.

David WB Parker is a principal of Parker Associates of Jacksonville, Florida, marketing consultants to the real estate industry; President of PTC Computer Solutions, IT Specialist, and an active real estate sales professional with Barclay’s Real Estate Group based in Jacksonville, FL.  He can be reached at 904-607-8763 or via email davidp@ptccomputersolutions.com.

Share this Parker Associates - PTC Computer Solutions Blog Post:
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Yahoo! Buzz
  • Twitter
  • Google Bookmarks
  • Blogosphere
  • Google Buzz
  • LinkedIn
  • Orkut
  • Reddit
  • RSS