I love PETL

When I started at my current job I noticed we we had lots of room for improvement about how we imported and exported data.  Folks had been using the MicroSoft SSIS platform as a way to Extract, Transform and Load data in and out of our database to various files.

SSIS is great for lots of things and has a lot of upsides. It is very drag and drop, folks don’t have to know a lot of programming to get it to do things, and it has lots of functions built in. If you need more programming power, you can execute C# or VB scripts to do the fiddly bits.

But I hate it. ( Don’t worry, we’ll get to the love soon.)

My biggest problems with SSIS:

  • It is unversionable. Try reading a git diff of an SSIS change. The xml is designed for a machine to read, not a human. If you want to know what has changed over time in your world, it’s a problem.
  • You can only use Visual Studio to edit it. Many of our SSIS packages include VB or C# scripts. That sounds fine – but apparently these compile to an undiffable, uneditable blob in the xml that is only recompiled if you save using visual studio. So if you want to change something across many SSIS package scripts, you have to open and resave each one.
  • It hides options under rocks. Finding out how something works requires lots of delving into lotsa windows and dialogues.
  • It changes things unexpectedly. Click in the wrong dialogue and it helpfully re-infers datatypes from a file for you. You don’t know until you go to execute.
  • It slapped my momma. Etc.

I wanted to move my team to something that was better for people.

We need something:

  • That we can diff
  • That we can do code reviews and pull requests on
  • That is simple, expressive and clear.
  • That is powerful.

To me that sounds like a programming language.  I encouraged folks on the team to try accomplishing a couple of tasks that might use an SSIS package instead to use Python. Immediately, things got better. Our code reviews made sense. Code quality improved with every single pull request.

We used pymssql to connect to SqlServer and inserted records as needed after processing them. Navigating and transforming XML docs was easy, CSV files were eaten up by the native DictReader.

And then Derrick found PETL. It’s beautiful. You point it at data and make simple moves to completely transform it. I’m smitten.

I had dozens of files to read from, each a quarterly file for a year – only noted in the file name. Each had a crappy heading line that preceded column headers. I needed to put them into 1 file for loading into SalesForce Wave. Whacking together a solution with PETL was effortless. Line 36 is where the PETL starts, and it’s so small and good that it is nice to see how much it encapsulates.

But wait, there's more

Leave a Reply