>

Preparing Data for LLM Applications Using Data Prep Kit

Data

|

When building data intensive applications, a significant portion of your time will be dedicated to data wrangling (cleaning, de-duping, removing markups, etc.). Data Prep Kit (https://github.com/IBM/data-prep-kit) is a new open source project that helps you with this.

Data Prep Kit (DPK) is an open source python library that can scale from your laptop to a highly scalable cluster in the cloud. It has been used at scale to prepare terabytes of data to train the IBM Granite Large Language Models (LLMS).

A few noteworthy features of DPK include: de-duping documents (exact dedupe and fuzzy dedupe), handling documents and code, language detection (spoken languages and programming languages), removing PII, malware detection and creating embeddings for a vector database.

In this talk, I will go over some interesting features of the Data Prep Kit. If time permits, I will show a demo.

Time & Duration:

11:45 am

Location

Lovelace Room

AI Themes

AI Tracks

Data

Featured Speaker

AI Engineer | IBM AI Alliance

Sujee Maniyam is a seasoned practitioner focusing on Generative AI, Machine Learning, Deep Learning, Big Data, Distributed Systems, and Cloud technologies. He also love teaching and has taught and mentored thousands of professionals.

Event Details

Time & Duration:

11:45 am
90 – 120 min

In-Person Location:

Lovelace Room

Watch online:

Featured Speaker

AI Engineer | IBM AI Alliance

Sujee Maniyam is a seasoned practitioner focusing on Generative AI, Machine Learning, Deep Learning, Big Data, Distributed Systems, and Cloud technologies. He also love teaching and has taught and mentored thousands of professionals.

Other Speakers Sessions