AHRQ diagnosis codes
In this case study, I needed to pull a series of diagnostic codes from an 11-page definition booklet. There were about 275 of these alphanumeric codes split into two columns, with definitions, so using some kind of cleaning tool was preferred to avoid manual cleaning errors.
The original PDF
The original document was a PDF document of 11 pages that looked liked this:
Cometdocs for conversion
At the time, I used a website called Cometdocs to convert the PDF into an Excel spreadsheet, but that tool no longer exists. Today I’d probably try Tablula or Adobe Acrobat or even ChatGPT.
This is the how the conversion came out:
Still quite a bit of cleanup to do.
OpenRefine challenges
This is how the Spreadsheet looked when imported into OpenRefine:
I needed to accomplish several things:
- Remove the definition columns.
- Remove the leftover rows that still had definition text. I did this by using a regular expression to find rows in the first column that started with the letter “O” followed by a number, as these are our diagnostic codes. I then used the inverse of the filter so I could remove the “other” rows.
- Use Transpose to combine the two columns into a single one and sort it.
The gif below runs through all the steps. Note that in reality, I was more methodical, testing each filter and step to make sure I was not removing data I needed to keep or vice versa.
This is the finished data and record of the steps: