Ask A Question

Notifications

You’re not receiving notifications from this thread.

how scrabe data every day

corrado tuccitto asked in General

how can I schedule to gather data with nokogiri and store it(only what has changed) in a db every morning?

Reply

That's a really open question. Can you give us an example of what you would want to scrape?

Reply

Err double post. Disregard.

Reply

I have a table with fixed items but variables as min price, med price, and max price change everyday.
For each item I want to scrabe every day min,med,max and store it on db.
the name of the item is always the same, but can be added/removed new/old items so it's important every day to check if there are new items.

here is an example of web site that you can scrabe

http://www.caat.it/it/listino/2015-08-28

many thanks in advance

Reply

This isn't efficient nor is it beautiful, but it'll at least get you started. You can use nokogiri as well, but I just used regex as a quick and (emphasis on) dirty solution. Plus I haven't had any caffeine yet today. Here:

require 'mechanize'

page = Mechanize.new.get "http://www.caat.it/it/listino/2015-08-28"

products = page.body.scan(/<tr class="(odd|even).+?">(.+?)<\/tr>/m).map{|thisproduct| thisproduct.last.to_s.scan(/<td data-title="(.+?)" class=".+?" >(.+?)<\/td>/m).map{|key,val| [key,CGI.unescapeHTML(val).gsub(/(<[^>]*>)|\n|\t/s){" "}.strip.chomp]}}

# products.count == 160

# products.first.each{|key,val| puts "#{key} => #{val}"}
# P. Min => 1,10
# P. Pre => 1,15
# P. Max => 1,20
# Specie => ARANCE
# Varietà => VALENCIA LATE
# Calibro => 70-80 (6)
# Cat. => I
# Presentazione => A PIU' STRATI
# Marchio => &nbsp;
# Origine => SUD AFRICA
# Confezione => &nbsp;
# Unita misura => &nbsp;
# Altre => &nbsp;
# Gruppo => AGRUMI

That should get you started with getting the data. What you do with it after that is up to you. :)

Reply

First of all thanks for your reply.
I have already developed a solution, but I'd like to see your implementation through threads. By video
thanks

Reply

Oh, I don't run GoRails. I'm just another subscriber like you.

I'm sure he'll see this, though. :)

Reply

python has much better eco system for scrapping data
http://scrapy.org/
but mechanize will work as well as https://github.com/felipecsl/wombat

Reply

You guys are awesome. :)

Mechanize is good. I've used that before. I've never checked out Scrapy or Wombat, but I'm sure they are awesome too.

Using whenever to schedule this script every morning will make that part easy.

I'll see about adding this to the list of screencasts to cover!

Reply
Join the discussion
Create an account Log in

Want to stay up-to-date with Ruby on Rails?

Join 86,946+ developers who get early access to new tutorials, screencasts, articles, and more.

    We care about the protection of your data. Read our Privacy Policy.