how scrabe data every day
how can I schedule to gather data with nokogiri and store it(only what has changed) in a db every morning?
I have a table with fixed items but variables as min price, med price, and max price change everyday.
For each item I want to scrabe every day min,med,max and store it on db.
the name of the item is always the same, but can be added/removed new/old items so it's important every day to check if there are new items.
here is an example of web site that you can scrabe
http://www.caat.it/it/listino/2015-08-28
many thanks in advance
This isn't efficient nor is it beautiful, but it'll at least get you started. You can use nokogiri as well, but I just used regex as a quick and (emphasis on) dirty solution. Plus I haven't had any caffeine yet today. Here:
require 'mechanize'
page = Mechanize.new.get "http://www.caat.it/it/listino/2015-08-28"
products = page.body.scan(/<tr class="(odd|even).+?">(.+?)<\/tr>/m).map{|thisproduct| thisproduct.last.to_s.scan(/<td data-title="(.+?)" class=".+?" >(.+?)<\/td>/m).map{|key,val| [key,CGI.unescapeHTML(val).gsub(/(<[^>]*>)|\n|\t/s){" "}.strip.chomp]}}
# products.count == 160
# products.first.each{|key,val| puts "#{key} => #{val}"}
# P. Min => 1,10
# P. Pre => 1,15
# P. Max => 1,20
# Specie => ARANCE
# Varietà => VALENCIA LATE
# Calibro => 70-80 (6)
# Cat. => I
# Presentazione => A PIU' STRATI
# Marchio =>
# Origine => SUD AFRICA
# Confezione =>
# Unita misura =>
# Altre =>
# Gruppo => AGRUMI
That should get you started with getting the data. What you do with it after that is up to you. :)
First of all thanks for your reply.
I have already developed a solution, but I'd like to see your implementation through threads. By video
thanks
Oh, I don't run GoRails. I'm just another subscriber like you.
I'm sure he'll see this, though. :)
python has much better eco system for scrapping data
http://scrapy.org/
but mechanize will work as well as https://github.com/felipecsl/wombat