Ask A Question

Notifications

You’re not receiving notifications from this thread.

Import CSV data using RubyZip and Postgresql COP

Jay Killeen asked in Gems / Libraries

I've checked out the videos on importing data to your rails app. Very timely as I am deep into this feature bringing data from all our legacy systems into my rails app.

My issue is I am importing millions of rows of data. I have used the postgresql COPY command to speed this up but it is still taking around 15-20mins to upload the 150MB files to the server (slow connection on the work VPN).

I have raised my question on SO if you can take a look.
http://stackoverflow.com/questions/29225140/using-rubyzip-in-rails-to-import-file-data-to-postgresql-with-copy

The gist of it is refactoring...

THIS

    def process file=nil
      file ||= @file.tempfile
      ActiveRecord::Base.connection.execute('truncate customers')
      conn = ActiveRecord::Base.connection_pool.checkout
      raw  = conn.raw_connection
      raw.exec("COPY customers FROM STDIN WITH (FORMAT CSV, DELIMITER ',',  NULL '')")
      # open up your CSV file looping through line by line and getting the line into a format suitable for pg's COPY...
      data = file.open
      data::gets

      ticker = 0
      counter = 0
      success_counter = 0
      failed_records = []

      data.each_with_index do |line, index|
        raw.put_copy_data line
        counter += 1
      end
      # once all done...
      raw.put_copy_end
      while res = raw.get_result do; end # very important to do this after a copy
      ActiveRecord::Base.connection_pool.checkin(conn)
      return { :item_count => counter, :processed_successfully => counter, :errored_records => failed_records }
    end

INTO

    def process file=nil
      file ||= @file.tempfile
      Zip::File.open(file) do |zip_file|
        zip_file.each do |entry|
          content = entry.get_input_stream
          data = content.read
          ActiveRecord::Base.connection.execute('truncate customers')
          conn = ActiveRecord::Base.connection_pool.checkout
          raw  = conn.raw_connection
          raw.exec("COPY customers FROM STDIN WITH (FORMAT CSV, DELIMITER ',',  NULL '')")
          # open up your CSV file looping through line by line and getting the line into a format suitable for pg's COPY...

          ticker = 0
          counter = 0
          success_counter = 0
          failed_records = []

          data.lines.map(&:chomp).each_with_index do |line, index|
            raw.put_copy_data line
            counter += 1
          end
          # once all done...
          raw.put_copy_end
          while res = raw.get_result do; end # very important to do this after a copy
          ActiveRecord::Base.connection_pool.checkin(conn)
          return { :item_count => counter, :processed_successfully => counter, :errored_records => failed_records }
        end
      end

But for whatever reason when I run it I am getting no data saved to the database.

Also to add to the videos that Chris has made. I chose to create an imports controller with an import module and a class for each model I am importing. It is working really well. It will be super duper if I can figure out how to import it form a Zip file :)

Reply

I love the idea of importing from a zip file.

I'm wondering where that might have gone wrong. It seems like it should have worked, but maybe the entry.get_input_upstream is not providing you the same value as it does when you're doing it without the Zip?

My suggestion would be to try inspecting the value of data in both cases and see if they are different in any way. That could be the issue there.

Reply

Yeah it is odd. I dived in with pry and saw that the strings are identical. I thought maybe a second loop has run through and truncate customers has been hit again so essentially all is working but I am destroying it all once I am finished! I'll try commenting out the truncation and doing it from PGAdmin and see if it works :)

Reply

Or ideally do this with Rspec but god I am finding that to be a steep learning curve!

Reply

Yeah I'm curious what might be causing it to not crash but silently fail. Seems like you're close but something is just slightly wrong.

The trouble with rspec here would be that it'll tell you it didn't work, but not why which is the real question.

Reply

This is what the data looks like in when I am importing from csv when the command raw.put_copy_data line is run

"415906,\"BISHOP, JOEL\",9 GARCIA COURT,PEREGIAN BEACH,4573,QLD,EXTERNAL,D,230,Country,Blocked,JOEL,N,A17,407,004,464,H01,400659,001,C16,2014-08-28 00:00:00,neym,2014-03-04 00:00:00,llavejoy,4\n"

This is the same line when running from zip.

"415906,\"BISHOP, JOEL\",9 GARCIA COURT,PEREGIAN BEACH,4573,QLD,EXTERNAL,D,230,Country,Blocked,JOEL,N,A17,407,004,464,H01,400659,001,C16,2014-08-28 00:00:00,neym,2014-03-04 00:00:00,llavejoy,4"

Writing this comment helped me solve the issue!!

Reply

two issues solved with this piece of code. It skips line one which is the header and appends the "\n" new line to the string.

          data.lines.map(&:chomp).each_with_index do |line, index|
            if index > 0
              line << "\n"
              raw.put_copy_data line
              counter += 1
            end
          end
Reply

Wow so it was as simple as the missing newline character?!

Reply
Join the discussion
Create an account Log in

Want to stay up-to-date with Ruby on Rails?

Join 85,972+ developers who get early access to new tutorials, screencasts, articles, and more.

    We care about the protection of your data. Read our Privacy Policy.