Okapi, or “What if ripgrep Could Edit?”

Image by Anne Karakash from Pixabay I needed to fix scannos in tens of thousands of line-based text files, so I built a tool called Okapi on top of ripgrep to let me find them in context and fix them in bulk using my text editor. Install it with homebrew. The project is digitizing tens of thousands of pages of US Government employee data. It’s called the Official Register, and there are over 100 volumes spanning 150 years. I’ve had great success with olmOCR, as it’s far more accurate than vanilla Tesseract. But it still generates many, many scannos. Double-U Double-U III # An example of a really common character sequence produced by the OCR which is almost never right is III. In a few rare cases, that really does mean that the person has the same name as his father…

Read more on Lobste.rs