Massive Daily MySQL Imports
I am working on a site that has to import a large amount of SQL data every night from a master feed provided by a third party. The import takes about 8 minutes (I hope to get that time down before deployment). The script has to truncate the tables before importing the fresh data, so there is a brief period of time where most or all the data is unavailable on the site. Do you think it is necessary to disable the site, sort of an under construction page while this import is happening? I think a better option would be to somehow import the data without messing with what is already there. What would be the best way to do that?
teammatt3 posted this at 16:30 — 3rd October 2008.
He has: 2,102 posts
Joined: Sep 2003
Have you thought about creating two databases? While one is being used on the production site, you could be importing into the other database. Once the import is done, a script runs that switches out your dbconfig file with another that is pointing to that other DB. You would just swap what database you are importing to every night (that could be automated).
pr0gr4mm3r posted this at 16:51 — 3rd October 2008.
He has: 1,502 posts
Joined: Sep 2006
The only problem I see with that is I would have to have a third database to store the data that doesn't change every day. I don't have any other tables as the site is fairly static besides the import data...just a potential problem in the future.
decibel.places posted this at 23:57 — 3rd October 2008.
He has: 1,494 posts
Joined: Jun 2008
Depends on the traffic, but 8+ minutes down can alienate users.
The 3 database solution sounds workable.
If it's a Drupal site, you can automate the site maintenance setting, but that would not be necessary with database switching.
I work on a ratings site that has a monthly reset of some tables in the database, but that is different, we don't import anything, just backup/truncate/update
pr0gr4mm3r posted this at 01:35 — 4th October 2008.
He has: 1,502 posts
Joined: Sep 2006
I really doubt many people will be browsing the site in the middle of the night, but it flags as a scalability issue, and it makes me paranoid.
Naw, I do most sites with Codeigniter.
JeevesBond posted this at 14:17 — 4th October 2008.
He has: 3,956 posts
Joined: Jun 2002
Well, unless it's a regional site, what's the middle of the night for you is the middle of the day for others.
Are you using
LOAD DATA LOCAL INFILE
to load the data? If so, I think the best way is the one Matt suggested. There are ways around the problem of static data.You could use two databases, one is your live db, the other you treat as a temporary data store. So pull down the data, do a
LOAD DATA LOCAL INFILE
into the temp db; optionally, delete records in the temp db that are already present in the live db; then use anINSERT IGNORE ... SELECT ...
statement to copy the data from the temp to the live db.Using table locking, and temporarily switching off indexes will speed up db to db copying significantly.
a Padded Cell our articles site!
pr0gr4mm3r posted this at 15:03 — 4th October 2008.
He has: 1,502 posts
Joined: Sep 2006
It's a real-estate site for a local area, so there wouldn't be many international visitors.
No...I wish I could, but the third party where we are getting the data has a habit of changing the format of the CSV files frequently, and w/o telling us. I parse it using PHP's fgetcsv(), and then format it all for the database.
Shaggy posted this at 18:26 — 26th May 2009.
They have: 121 posts
Joined: Dec 2008
I'm (a lot) late to the party, but here are a couple more ideas:
1) Wrap everything in a transaction* :
BEGIN WORK
delete from tblname
... bunch of inserts
if success
COMMIT
otherwise
ROLLBACK
* Never tried this in MySQL, though it apparently supports transactions now...
2) Implement a sort of 'row versioning':
Add a table 'imports' with an import_id start timestamp, end timestamp, and a 'status' field as well as anything else you might like (number of rows inserted, etc).
Add a field to your import target, 'import_id', and change your indexes/keys to be unique (compound) for each 'import_id' and listing_id.
Before begining the import, insert a row in 'imports' to obtain a new unique 'import_id'. Have the status = 'incomplete'.
insert your data, with each row having the new 'import_id' you've gained.
When the import is done, and successful, you can update the 'imports' table row with a 'success' status.
When successful, you can either delete the previous 'version' (import_id) of the rows, or leave them there for historical purposes.
In your application, you'll be selecting rows with your current criteria plus "and import_id = (select import_id from imports where status = 'success' and end_timestamp = (select max(end_timestamp) from imports where status = 'success'))
Cheers,
Shaggy
greg posted this at 20:52 — 26th May 2009.
He has: 1,581 posts
Joined: Nov 2005
Greg K posted this at 21:27 — 26th May 2009.
He has: 2,145 posts
Joined: Nov 2003
(yes the solution is probably already in place by now, but I like posting anyhow for those who may find this thread on a search engine looking for similar solutions)
Also, something I did to majorly speed up the import was instead of issuing SQL statements to insert the data, I use http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html mysqlimport to quickly import a whole CSV file straight to the table.
Some tips.
If your working on a local mySQL server (the CSV file is on the same server) use the LOAD DATA first, so you if you get any warnings you can then SHOW WARNINGS afterwards.
I needed to delete certain records as well, and only insert most from the feed I received. I had a script go through the CSV file, and for any record that was going to need deteled, set a flag on the record. Then any that needed inserted, I'm picky, and do a data integrety check to make sure i didn't get any bad rows (ie. make sure each row the field count matches), also I get dates in MM/DD/YYYY format, this is a good time two swap those out to YYYY-MM-DD. Save all the good data into a new CSV file.
As this checking and processing what will take the majority of time, your data in the database is intact until you are ready to finally do the update.
Now you can do a sinlge DELETE FROM tblWhatever WHERE DelFlag='Y'
Then immediately do the mysqlimport (you will need to run this via system() command.)
I have found on two projects that this is so much faster than doing inserts/updates. Also another method would be instead of doing an extra CSV file you could just insert all the cleansed data into a temporary table, and then issue one statement to insert all records from the temp table into the live one.
I stick with CSV file for other purposes here though relavent to the project.
Also, not sure how you are getting the data from 3rd party, but if it is like us, where they have their inventory system export about 4 CSV files each night (and accompanying "long text".txt files for descriptions in another directory, rsync (on the linux end) and DeltaCopy (which easily wraps and sets up rsync on the windows end) is a great tool. Extremely fast and it has ways to transfer just what had changed, not everything all over again. Blew away establishing an FTP connection.
-Greg
pr0gr4mm3r posted this at 22:23 — 26th May 2009.
He has: 1,502 posts
Joined: Sep 2006
That was my first thought, but the CSV files are not consistent, and the slightest change would cause a big problem.
I ended up going with Matt's solution. The website has three databases. DB1 holds the consistent data, and DB2 & DB3 holds the daily imported data. If we are using DB2 today, then when the import script runs, DB3 will be truncated and populated while the website is serving DB2. Once the import is done, the active database is swapped out with the updated one by changing a constant.
Clunky?? Maybe, compared to some of the other solutions posted here, but this company is a bit of a penny pincher, and they wern't wiling to pay me to come up with something better. And, it's worked so far. I support my code free for a year, so the company would get me on the phone in a heartbeat if there was a problem. I had to iron out a couple minor glitches after the launch, but I haven't heard from them on this project since late last year, so it must be working for them.
Shaggy posted this at 01:55 — 27th May 2009.
They have: 121 posts
Joined: Dec 2008
Also sounds like a good approach. Another option could be two tables instead of two databases you switch between.
Ah, real estate data. It is always interesting what weird and wonderful ways the boards come up with to distribute that stuff eh? Have you had the 'pleasure' of dealing with (board's interpretation of) RETS feeds yet?
Cheers,
Shaggy
pr0gr4mm3r posted this at 02:17 — 27th May 2009.
He has: 1,502 posts
Joined: Sep 2006
At least that is in XML. This is what I had to deal with.
Shaggy posted this at 14:15 — 27th May 2009.
They have: 121 posts
Joined: Dec 2008
Yes, I feel for you. Be happy though, this board actually supplied a data dictionary - many don't even go that far...
solarlight posted this at 16:34 — 25th February 2012.
They have: 13 posts
Joined: Jan 2012
needed to delete certain records as well, and only insert most from the feed I received. I had a script go through the CSV file, and for any record that was going to need deteled, set a flag on the record.
Solar Led Street Lights Ahmedabad | Solar street lights
David26 posted this at 12:07 — 15th March 2012.
They have: 10 posts
Joined: Mar 2012
The only problem I see with that is I would have to have a third database to store the data that doesn't change every day. I don't have any other tables as the site is fairly static besides the import data...just a potential problem in the future.
Philadelphia Sunglasses
Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.