Crawling Image from google.

vignesh

2 min read

I worked on a project where we needed to get images of various products based on user interest. I decided to use Google to get the images we need. And we read that they expose a Search API, so seemed liked a good approach.
spritle
 

Google Search API

The API is known as ‘Google Custom Search API’. They have specified a format to make the API request and they let us apply all types of filters and perform the search to get results.
But the catch is the fact that the API provides only 100 search queries per day for free and additional requests cost $5 per 1000 queries, with a cap of 10k queries per day.
As you might imagine, things could get costly very quickly and since I was working on an MVP, did not want to have our clients invest on this.
Instead of the API, we decided to use the familiar and ever-useful technique of “Web Scraping” to get the job done.

Scraping Google for images

Scraping web results from Google is a pretty easy job. But it’s a bit of a challenge to crawl for image results. In this post I am going to explain how to scrape images from Google’s image search results using Ruby and Capybara.
Ruby has a good libraries for crawling and scrapping. You might know that Capybara is used for testing in Rails. We can also use it for scraping purposes.
We would also need a headless browser (a browser that runs without a GUI), so that we can make it run inside our servers. We decided to go with PhantomJS.
Inorder to use PhatomJS with Capybara we will need a gem called ‘poltergeist‘. Poltergeist has some of its own classes and supports the API of Capybara.
That’s enough with the boring theory, let’s dive into some code.

Prerequisites

You should have Ruby and PhantomJS installed in your system.
From your terminal, first install the prerequisites.


$ gem install  capybara
$ gem install poltergeist
$ gem install dsl -v 0.1
$ gem install phantomjs

We’ve now installed all the required gems.
Require the installed gems.

 require 'capybara'
 require 'capybara/dsl'
 require 'capybara/poltergeist'

We have to configure capybara for poltergeist and set some conditions.

option = { js_errors: false, timeout: 120 }
Capybara.register_driver :poltergeist do |app|
  Capybara::Poltergeist::Driver.new(app, option)
end

We have to now create a new session with Capybara.

session = Capybara::Session.new(:poltergeist)

Visit the URL for Google images

url = "https://www.google.com/imghp?"
session.visit url

Fill in the search query input with our keywords

session.fill_in('q', with: 'Spritle Software')

Submitting the search query by clicking search button.

session.click_button('Search')

Let’s select the first search result and print out its URL

image_url = session.first('img')['src']
puts image_url

Once we’ve done what we need, we’ll need to quit the session using –

session.driver.quit

This is a simple example showing how you can use Capybara in Ruby to scrape images from Google image search.
You can find the entire code listing at here.

Conclusion

Its very difficult to get the exact url, so we fetch only the url of image tile in results page which is a smaller sized one. For our application it is sufficient. If anyone able to get the exact url of the image, please put a comment about it. Even if I worked it out, I will update it.
I hope you got a little grasp about crawling and scrapping.
 

Related posts:

Leave a Reply

Your email address will not be published. Required fields are marked *