AgainstInequality is a web app designed to unite as much corporate and political data as possible, with the goal of determining who owns the largest American corporations and what political candidates they donate to.
At the time I started this, I couldn't find a single free corporate ownership database. There are excellent general information sites, like OpenCorporates,So, I made a set of scripts which download all corporate ownership-related forms from the SEC and process them into a summary table which can easily be searched: SEC DataGrinder
It also downloads and processes the excellent OpenSecrets data, which was very well orgranized into a database and no problem at all to work with.
This...was a tremendous amount of work. The reason is, the forms themselves which you can download from the SEC are absurdly messy. They are provided exactly as they were submitted, so they could be .html files, or .pdfs, or even text files. There could be 3mb of added information to them.
Here are two utterly different examples from Merrill Lynch, the same company:
This required a tremendous amount of heuristics to figure out where the all-important numbers are: numbers of shares, CIK (internal SEC unique ids) and percent of ownership.
These days, I'd throw machine learning against it instead, as it's much better suited to making rules based on data, and I have the current database as an excellent training set.