All threads / how scrabe data every day

Ask A Question

Notifications

You’re not receiving notifications from this thread.

how scrabe data every day

corrado tuccitto asked in General

how can I schedule to gather data with nokogiri and store it(only what has changed) in a db every morning?

That's a really open question. Can you give us an example of what you would want to scrape?

Err double post. Disregard.

I have a table with fixed items but variables as min price, med price, and max price change everyday.
For each item I want to scrabe every day min,med,max and store it on db.
the name of the item is always the same, but can be added/removed new/old items so it's important every day to check if there are new items.

here is an example of web site that you can scrabe

http://www.caat.it/it/listino/2015-08-28

many thanks in advance

This isn't efficient nor is it beautiful, but it'll at least get you started. You can use nokogiri as well, but I just used regex as a quick and (emphasis on) dirty solution. Plus I haven't had any caffeine yet today. Here:

require 'mechanize'

page = Mechanize.new.get "http://www.caat.it/it/listino/2015-08-28"

products = page.body.scan(/<tr class="(odd|even).+?">(.+?)<\/tr>/m).map{|thisproduct| thisproduct.last.to_s.scan(/<td data-title="(.+?)" class=".+?" >(.+?)<\/td>/m).map{|key,val| [key,CGI.unescapeHTML(val).gsub(/(<[^>]*>)|\n|\t/s){" "}.strip.chomp]}}

# products.count == 160

# products.first.each{|key,val| puts "#{key} => #{val}"}
# P. Min => 1,10
# P. Pre => 1,15
# P. Max => 1,20
# Specie => ARANCE
# Varietà => VALENCIA LATE
# Calibro => 70-80 (6)
# Cat. => I
# Presentazione => A PIU' STRATI
# Marchio => &nbsp;
# Origine => SUD AFRICA
# Confezione => &nbsp;
# Unita misura => &nbsp;
# Altre => &nbsp;
# Gruppo => AGRUMI

That should get you started with getting the data. What you do with it after that is up to you. :)

First of all thanks for your reply.
I have already developed a solution, but I'd like to see your implementation through threads. By video
thanks

Oh, I don't run GoRails. I'm just another subscriber like you.

I'm sure he'll see this, though. :)

python has much better eco system for scrapping data
http://scrapy.org/
but mechanize will work as well as https://github.com/felipecsl/wombat

You guys are awesome. :)

Mechanize is good. I've used that before. I've never checked out Scrapy or Wombat, but I'm sure they are awesome too.

Using whenever to schedule this script every morning will make that part easy.

I'll see about adding this to the list of screencasts to cover!

Join the discussion

Want to stay up-to-date with Ruby on Rails?

Join 39,609+ developers who get early access to new tutorials, screencasts, articles, and more.

    We care about the protection of your data. Read our Privacy Policy.

    logo Created with Sketch.

    Ruby on Rails tutorials, guides, and screencasts for web developers learning Ruby, Rails, Javascript, Turbolinks, Stimulus.js, Vue.js, and more. Icons by Icons8

    © 2020 GoRails, LLC. All rights reserved.