how scrabe data every day

corrado tuccitto August 28, 2015 7:16am

how can I schedule to gather data with nokogiri and store it(only what has changed) in a db every morning?

Robert Adler August 28, 2015 12:25pm

That's a really open question. Can you give us an example of what you would want to scrape?

Robert Adler August 28, 2015 12:25pm

Err double post. Disregard.

corrado tuccitto August 28, 2015 12:34pm

I have a table with fixed items but variables as min price, med price, and max price change everyday.
For each item I want to scrabe every day min,med,max and store it on db.
the name of the item is always the same, but can be added/removed new/old items so it's important every day to check if there are new items.

here is an example of web site that you can scrabe

http://www.caat.it/it/listino/2015-08-28

many thanks in advance

Robert Adler August 28, 2015 1:02pm

This isn't efficient nor is it beautiful, but it'll at least get you started. You can use nokogiri as well, but I just used regex as a quick and (emphasis on) dirty solution. Plus I haven't had any caffeine yet today. Here:

require 'mechanize'

page = Mechanize.new.get "http://www.caat.it/it/listino/2015-08-28"

products = page.body.scan(/<tr class="(odd|even).+?">(.+?)<\/tr>/m).map{|thisproduct| thisproduct.last.to_s.scan(/<td data-title="(.+?)" class=".+?" >(.+?)<\/td>/m).map{|key,val| [key,CGI.unescapeHTML(val).gsub(/(<[^>]*>)|\n|\t/s){" "}.strip.chomp]}}

# products.count == 160

# products.first.each{|key,val| puts "#{key} => #{val}"}
# P. Min => 1,10
# P. Pre => 1,15
# P. Max => 1,20
# Specie => ARANCE
# Varietà => VALENCIA LATE
# Calibro => 70-80 (6)
# Cat. => I
# Presentazione => A PIU' STRATI
# Marchio => &nbsp;
# Origine => SUD AFRICA
# Confezione => &nbsp;
# Unita misura => &nbsp;
# Altre => &nbsp;
# Gruppo => AGRUMI

That should get you started with getting the data. What you do with it after that is up to you. :)

corrado tuccitto August 28, 2015 1:06pm