Import CSV data using RubyZip and Postgresql COP
I've checked out the videos on importing data to your rails app. Very timely as I am deep into this feature bringing data from all our legacy systems into my rails app.
My issue is I am importing millions of rows of data. I have used the postgresql COPY command to speed this up but it is still taking around 15-20mins to upload the 150MB files to the server (slow connection on the work VPN).
I have raised my question on SO if you can take a look.
http://stackoverflow.com/questions/29225140/using-rubyzip-in-rails-to-import-file-data-to-postgresql-with-copy
The gist of it is refactoring...
THIS
def process file=nil
file ||= @file.tempfile
ActiveRecord::Base.connection.execute('truncate customers')
conn = ActiveRecord::Base.connection_pool.checkout
raw = conn.raw_connection
raw.exec("COPY customers FROM STDIN WITH (FORMAT CSV, DELIMITER ',', NULL '')")
# open up your CSV file looping through line by line and getting the line into a format suitable for pg's COPY...
data = file.open
data::gets
ticker = 0
counter = 0
success_counter = 0
failed_records = []
data.each_with_index do |line, index|
raw.put_copy_data line
counter += 1
end
# once all done...
raw.put_copy_end
while res = raw.get_result do; end # very important to do this after a copy
ActiveRecord::Base.connection_pool.checkin(conn)
return { :item_count => counter, :processed_successfully => counter, :errored_records => failed_records }
end
INTO
def process file=nil
file ||= @file.tempfile
Zip::File.open(file) do |zip_file|
zip_file.each do |entry|
content = entry.get_input_stream
data = content.read
ActiveRecord::Base.connection.execute('truncate customers')
conn = ActiveRecord::Base.connection_pool.checkout
raw = conn.raw_connection
raw.exec("COPY customers FROM STDIN WITH (FORMAT CSV, DELIMITER ',', NULL '')")
# open up your CSV file looping through line by line and getting the line into a format suitable for pg's COPY...
ticker = 0
counter = 0
success_counter = 0
failed_records = []
data.lines.map(&:chomp).each_with_index do |line, index|
raw.put_copy_data line
counter += 1
end
# once all done...
raw.put_copy_end
while res = raw.get_result do; end # very important to do this after a copy
ActiveRecord::Base.connection_pool.checkin(conn)
return { :item_count => counter, :processed_successfully => counter, :errored_records => failed_records }
end
end
But for whatever reason when I run it I am getting no data saved to the database.
Also to add to the videos that Chris has made. I chose to create an imports controller with an import module and a class for each model I am importing. It is working really well. It will be super duper if I can figure out how to import it form a Zip file :)
I love the idea of importing from a zip file.
I'm wondering where that might have gone wrong. It seems like it should have worked, but maybe the entry.get_input_upstream
is not providing you the same value as it does when you're doing it without the Zip?
My suggestion would be to try inspecting the value of data
in both cases and see if they are different in any way. That could be the issue there.
Yeah it is odd. I dived in with pry and saw that the strings are identical. I thought maybe a second loop has run through and truncate customers has been hit again so essentially all is working but I am destroying it all once I am finished! I'll try commenting out the truncation and doing it from PGAdmin and see if it works :)
Yeah I'm curious what might be causing it to not crash but silently fail. Seems like you're close but something is just slightly wrong.
The trouble with rspec here would be that it'll tell you it didn't work, but not why which is the real question.
This is what the data looks like in when I am importing from csv when the command raw.put_copy_data line
is run
"415906,\"BISHOP, JOEL\",9 GARCIA COURT,PEREGIAN BEACH,4573,QLD,EXTERNAL,D,230,Country,Blocked,JOEL,N,A17,407,004,464,H01,400659,001,C16,2014-08-28 00:00:00,neym,2014-03-04 00:00:00,llavejoy,4\n"
This is the same line when running from zip.
"415906,\"BISHOP, JOEL\",9 GARCIA COURT,PEREGIAN BEACH,4573,QLD,EXTERNAL,D,230,Country,Blocked,JOEL,N,A17,407,004,464,H01,400659,001,C16,2014-08-28 00:00:00,neym,2014-03-04 00:00:00,llavejoy,4"
Writing this comment helped me solve the issue!!
two issues solved with this piece of code. It skips line one which is the header and appends the "\n" new line to the string.
data.lines.map(&:chomp).each_with_index do |line, index|
if index > 0
line << "\n"
raw.put_copy_data line
counter += 1
end
end