Importing datasets. Recommendations?
I have a few datasets that I need to import into my rails app. They vary from 100 records to 5000 which will span across a few models with various associations.
Q.1: I can cleanse the data prior to importing... Any recommendations of data type: CSV, Excel, Tab delimitated, comma separated, etc?
Q.2: Any gems, or techniques that are common use or best practice or should I write a custom rake/chron task?
Comma separated CSV is usually pretty good standard to go with. Sometimes Excel saves with weird encodings though so it makes it less simple.
I think roo is pretty good for parsing files like these. I'd write a rake task to test it against a local file. The rake task can then just call a method a model so that it's in the correct place for use in your Rails app. Plus if you create a rake task, you can set it up to run nightly with a cron job if you ever had the need for that.
I also recommend CSV and try to make sure the CSV file is in UTF-8. as Excel sometimes encodes the files as UTF-16 or other weird things. Also saving an excel file as CSV doesn't always mean it is comma separated. Excel generally defaults to tab separated. To fix all these things I run bash scripts and use tools such as iconv
and awk
iconv -f utf-16 -t utf-8 sales.csv > sales_utf8.csv
will convert your utf-16 to utf-8.
This is the bash script I use to convert an Cognos tab separated files into straight csv (comma separated).
awk 'BEGIN{FS="\t";OFS=",";Q="\""}
{for (i=1;i<=NF;++i)
if ($i ~ /[",]/)
$i = Q gensub(/"/,Q Q,"g",$i) Q
}
{$1 = $1}
1' sales_utf8.csv > sales_utf8_awk.csv
Oooh Evil Cognos... lol.
I'm having to deal with data translation and import so the way I did it was to write a simple rake task using the CSV
class that imports my data into the appropriate models. I gleamed some of the code from Chris' CSV upload tutorial he posted on GoRails recently. If you need some help or example code, let me know. I love this sort of stuff.