Skip to main content
Ask A Question
Notifications
You’re not receiving notifications from this thread.
Subscribe

how scrabe data every day

General • Asked by corrado tuccitto

how can I schedule to gather data with nokogiri and store it(only what has changed) in a db every morning?


That's a really open question. Can you give us an example of what you would want to scrape?


Err double post. Disregard.


I have a table with fixed items but variables as min price, med price, and max price change everyday.
For each item I want to scrabe every day min,med,max and store it on db.
the name of the item is always the same, but can be added/removed new/old items so it's important every day to check if there are new items.

here is an example of web site that you can scrabe

http://www.caat.it/it/listino/2015-08-28

many thanks in advance


This isn't efficient nor is it beautiful, but it'll at least get you started. You can use nokogiri as well, but I just used regex as a quick and (emphasis on) dirty solution. Plus I haven't had any caffeine yet today. Here:

require 'mechanize'

page = Mechanize.new.get "http://www.caat.it/it/listino/2015-08-28"

products = page.body.scan(/<tr class="(odd|even).+?">(.+?)<\/tr>/m).map{|thisproduct| thisproduct.last.to_s.scan(/<td data-title="(.+?)" class=".+?" >(.+?)<\/td>/m).map{|key,val| [key,CGI.unescapeHTML(val).gsub(/(<[^>]*>)|\n|\t/s){" "}.strip.chomp]}}

# products.count == 160

# products.first.each{|key,val| puts "#{key} => #{val}"}
# P. Min => 1,10
# P. Pre => 1,15
# P. Max => 1,20
# Specie => ARANCE
# Varietà => VALENCIA LATE
# Calibro => 70-80 (6)
# Cat. => I
# Presentazione => A PIU' STRATI
# Marchio => &nbsp;
# Origine => SUD AFRICA
# Confezione => &nbsp;
# Unita misura => &nbsp;
# Altre => &nbsp;
# Gruppo => AGRUMI

That should get you started with getting the data. What you do with it after that is up to you. :)


First of all thanks for your reply.
I have already developed a solution, but I'd like to see your implementation through threads. By video
thanks


Oh, I don't run GoRails. I'm just another subscriber like you.

I'm sure he'll see this, though. :)


python has much better eco system for scrapping data
http://scrapy.org/
but mechanize will work as well as https://github.com/felipecsl/wombat


You guys are awesome. :)

Mechanize is good. I've used that before. I've never checked out Scrapy or Wombat, but I'm sure they are awesome too.

Using whenever to schedule this script every morning will make that part easy.

I'll see about adding this to the list of screencasts to cover!


Login or Create An Account to join the conversation.

Subscribe to the newsletter

Join 29,763+ developers who get early access to new screencasts, articles, guides, updates, and more.

    By clicking this button, you agree to the GoRails Terms of Service and Privacy Policy.

    More of a social being? We're also on Twitter and YouTube.