When building data intensive applications, a significant portion of your time will be dedicated to data wrangling (cleaning, de-duping, removing markups, etc.). Data Prep Kit (https://github.com/IBM/data-prep-kit) is a new open source project that helps you with this.
Data Prep Kit (DPK) is an open source python library that can scale from your laptop to a highly scalable cluster in the cloud. It has been used at scale to prepare terabytes of data to train the IBM Granite Large Language Models (LLMS).
A few noteworthy features of DPK include: de-duping documents (exact dedupe and fuzzy dedupe), handling documents and code, language detection (spoken languages and programming languages), removing PII, malware detection and creating embeddings for a vector database.
In this talk, I will go over some interesting features of the Data Prep Kit. If time permits, I will show a demo.