MLDatasetBuilder – A python package for build your custom dataset 

karthick Nagarajan

2 min read

As an ML Newbie, I need to figure out the best way to prepare the dataset for our machine learning training model. As per my last article, I came up with a Python package for this process!

Whenever you are training a custom model the important thing is the dataset. Yes, of course, the dataset plays the main role in deep learning. The accuracy of your model will be based on the dataset. So, before you train a custom model, you need to plan how to build dataset? Here, I’m going to share my ideas on the easy way to build your dataset.

MLDatasetBuilder-Version 1.0.0

A Python package to build Dataset for Machine Learning

Whenever we begin a machine learning project, the first thing that we need is a dataset. Datasets will be the pillar of the training model. You can build the dataset either automatically or manually. MLDatasetBuilder is a python package which is helping to prepare the image for your ML dataset.

Github Repo: karthick965938/ML-Dataset-Builder


We can install MLDatasetBuilder package using the below command

pip install MLDatasetBuilder

How to test?

When you run python3 in the terminal, it will produce output like this:

Python 3.6.9 (default, Apr 18 2020, 01:56:04) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

Run the following code to you can get the Initialize process output for the MLDatasetBuilder package.

>>> from MLDatasetBuilder import *
>>> MLDatasetBuilder()
Output For Initialize Process

Available Operations

PrepareImage — Remove unwanted format images and Rename your images

#PrepareImage(folder_name, image_name)
PrepareImage('images', 'dog')

ExtractImages — Extract images from video file

#ExtractImages(video_path, file_name, frame_size)
ExtractImages('video.mp4', 'frame', 10)
#ExtractImages(video_path, filename)
ExtractImages('video.mp4', 'frame')
#Default FPS will be 5

Step1 — Get images from google

Yes, we can get images from Google. Using the Download All Images browser extension we can easily get images in a few minutes. You can check out here for more details about this extension!

Get images from google

Step2 — Create a Python file

Once you have downloaded the images using this extension, you can create a python file called the same directory as below.

| _14e839ba-9691-11ea-a968-2ed746e9a968.jpg
| 5e5f7af12600004018b602c0.jpeg
| A471529_Alice_b-1.jpg
| image1.png
| image2.png
| ...

Inside the images folder, you can see lots of PNG images and random filenames.

Step3 — PrepareImage

MLDatasetBuilder provides a method called PrepareImage. Using this method to we can remove the unwanted images and rename your image files which are already you have downloaded from the browser’s extensions.

PrepareImage(folder_path, class_name)
#PrepareImage('download_image_folder', 'dog')

As per the above code, we need to mention the image folder path and class name.

Output for PrepareImage Option

After completing the process your image folder structure will look like below 

| dog_0.jpg
| dog_1.jpg
| dog_2.jpg
| dog_3.png
| dog_4.png
| ...

This process very helps to annotate your images while labelling. And of course, it will be like one of the standardized things.

Step4 — ExtractImage

MLDatasetBuilder also provides a method called ExtractImages. Using this method we can extract the images from the video files.


As per the below code, we need to mention the video path, folder name, and framesize. Folder name will the class name and framesize’s default value 5 and it’s not mandatory.

ExtractImages(video_path, folder_name, framesize)
#ExtractImages('video.mp4', 'frame', 10)
ExtractImages(video_path, folder_name)
#ExtractImages('video.mp4', 'frame')
Output for ExtractImage method

After completing the process your image folder structure will look like below

| dog_0.jpg
| dog_1.jpg
| dog_2.jpg
| dog_3.png
| dog_4.png
| ...

What is version 2.0.0?

I have planned to release version 2.0.0 on next month, This will include some additional features.

I mean this package will provide more than 100 objects images with annotations file 🙂


All issues and pull requests are welcome! To run the code locally, first, fork the repository and then run the following commands on your computer:

git clone<your-username>/ML-Dataset-Builder.git
cd ML-Dataset-Builder
# Recommended creating a virtual environment before the next step
pip3 install -r requirements.txt

When adding code, be sure to write unit tests where necessary.


MLDatasetBuilder was created by Karthick Nagarajan. Feel free to reach out on Twitter Linkedin or through Email!

Related posts:

Leave a Reply

Your email address will not be published. Required fields are marked *