2y ago

Dealing with Databases - Inserting, Updating Etc.

Wondering if anyone here has some advise or a good place to learn about dealing with databases with Python. I know SQL fairly well for pulling data and simple updates, but running into potential performance issues the way I've been doing it. Here are 2 examples.

Dealing with Pandas dataframes. I'm doing some reconciliation between a couple of different datasources. I do not have a primary key to work with. I have some very specific matching criteria to determine a match. The matching process is all built within Python. Is there a good way to do the database commits with updates/inserts en masse vs. line by line? I've looked into upsert (or inserts with clause to update with existing data), but pretty much all examples I've seen rely on primary keys (which I don't have since the data has 4 columns I'm matching on).
Dealing with JSON files which have multiple layers of related data. My database is built in such a way that I have a table for header information, line level detail, then third level with specific references assigned to the line level detail. As with a lot of transactional type databases there can be multiple references per line, multiple lines per header. I'm currently looping through the JSON file starting with the header information to create the primary key, then going to the line level detail to create a primary key for the line, but also include the foreign key for the header and also with the reference data. Before inserting I'm doing a lookup to see if the data already exists and then updating if it does or inserting a new record if it doesn't. This works fine, but is slow taking several seconds for maybe 100 inserts in total. While not a big deal since it's for a low volume of sales. I'd rather learn best practice and do this properly with commits/transactions vs inserting an updating each record individually within the ability to rollback should an error occur.

9 comments

I typically like using sqlalchemy's ORM for my database operations.
For something simple, using it with sqlite3 can be more efficient than parsing through a JSON file. Combine this with a primary key to help with the double-insertion problem (so to not have to iterate through before inserting) and it can work out quite well.
I've never really used Pandas dataframes though.
Another fun option (if willing to not use a database, but rather a disk-cache) is https://github.com/grantjenks/python-diskcache. Behind the scenes it actually also uses a sqlite3 db.
- What function or class from that library would I use to do this? Or what key words can I use to search and learn more? I'm struggling wrapping my head around it.
  
  Here is a basic tutorial for how sqlalchemy works. If you already have a database in place you'll have to port your schema to it, but it might be worth trying out and seeing if it's more performat.

Databases are more efficient with bulk queries.
Rather than query each entry individually batch your data and query for the existence of that batch (e.g. where key in (1,2,3,etc)). You could do this one out json document, once per 100 entries, or however it makes sense. You can then check the results for your key to determine whether to insert or update. Then commit on that batch set.
- Do you happen to have any examples? I'm just not sure how to convert the JSON example into a bulk query since I need to keep the reference and line detail associated to the header. There is no primary key across all 3 sections. It's generated when I insert into the database.
  
  It's a little hard to say without seeing your datastructure. Is this something like
  undefined
  
  { header: { id: 1, items: [ { name: "foo", field2: "bar" } ] }
  If you have something unique in the "header" you can create 2 tables with a dependency.
  undefined
  
  create table header ( id number ); create table item ( id number, header_id number, name varchar, field2 varchar);
  You can generate IDs for each item on-the-fly but won't be able to tie the back to the JSON. BUT if you can tie back header to the JSON then you can do a "drop-and-replace" on the items with each run. Which may not be the most efficient but it will likely perform better than querying each row upon entry. e.g. (pseudocode)
  undefined
  
  for each header in headers { delete from item where item.parent_id = header.id; for each item in header.items { insert into item values ( some_id, header.id, item.name, item.field2 ); } commit; }
  But if you don't want to drop/re-create then if there' some combination of things in the "item" that is unique then you can use that as a compound key. In the worst case you can just use all the columns. I once created a primary key that was an MD5 checksum of the string value of all the fields in the row. It gave me a calculable primary key which was good and I could query off it easily. But it does make expanding the table much harder...
  The advantage of drop-and-replace will be that removed items in the JSON will also be removed in the database. Otherwise you'll need to do some additional cleanup to find database entries that don't have an entry in your JSON file(s).

9 comments