how scrabe data every day

corrado tuccitto asked in General

how can I schedule to gather data with nokogiri and store it(only what has changed) in a db every morning?

That's a really open question. Can you give us an example of what you would want to scrape?

I have a table with fixed items but variables as min price, med price, and max price change everyday.
For each item I want to scrabe every day min,med,max and store it on db.
the name of the item is always the same, but can be added/removed new/old items so it's important every day to check if there are new items.

here is an example of web site that you can scrabe

many thanks in advance

This isn't efficient nor is it beautiful, but it'll at least get you started. You can use nokogiri as well, but I just used regex as a quick and (emphasis on) dirty solution. Plus I haven't had any caffeine yet today. Here:

require 'mechanize'

page = ""

products = page.body.scan(/<tr class="(odd|even).+?">(.+?)<\/tr>/m).map{|thisproduct| thisproduct.last.to_s.scan(/<td data-title="(.+?)" class=".+?" >(.+?)<\/td>/m).map{|key,val| [key,CGI.unescapeHTML(val).gsub(/(<[^>]*>)|\n|\t/s){" "}.strip.chomp]}}

# products.count == 160

# products.first.each{|key,val| puts "#{key} => #{val}"}
# P. Min => 1,10
# P. Pre => 1,15
# P. Max => 1,20
# Specie => ARANCE
# Varietà => VALENCIA LATE
# Calibro => 70-80 (6)
# Cat. => I
# Presentazione => A PIU' STRATI
# Marchio => &nbsp;
# Origine => SUD AFRICA
# Confezione => &nbsp;
# Unita misura => &nbsp;
# Altre => &nbsp;
# Gruppo => AGRUMI

That should get you started with getting the data. What you do with it after that is up to you. :)

First of all thanks for your reply.
I have already developed a solution, but I'd like to see your implementation through threads. By video

Oh, I don't run GoRails. I'm just another subscriber like you.

I'm sure he'll see this, though. :)

python has much better eco system for scrapping data
but mechanize will work as well as

You guys are awesome. :)

Mechanize is good. I've used that before. I've never checked out Scrapy or Wombat, but I'm sure they are awesome too.

Using whenever to schedule this script every morning will make that part easy.

I'll see about adding this to the list of screencasts to cover!

