If you ask Saheed Azeez, about the difficulty level of
creating Naijaweb, a dataset of 230 million GPT2 tokens based on Nairaland,
he'll tell you it is easy. "All you need is to know web scraping and data
cleaning," he said.
However, when he explained how he created Naijaweb,
most of what he said flew over my head. As a final year Mechanical Engineering
student from the University of Lagos, Nigeria, Naijaweb is impressive, even if
Azeez doesn't completely agree.
Naijaweb is not another ChatGPT or a version of Ijemma
Onwuzulike's IgboSpeech, it is a dataset that can be used to train a large
language model (LLM), like the one that powers ChatGPT.
Feeding data into LLMs sounds like a walk in the park,
but you need web scraping and data-cleaning skills from scraping to converting
into GPT tokens.
Advertisement
Azeez started learning web scraping and data cleaning
skills, which he used to create Naijaweb in a Python class in 2019.
The only reason he joined the class was because of the
mention of machine learning, which he thought meant teaching robots or machines
how to learn. As a mechanical engineering student, he thought it could come in
handy, but He soon realised that there were no physical machines involved in
machine learning.
Rather than be disappointed, he got even more
interested; thanks to the COVID-19 pandemic in 2020, he had time to take it
seriously.
He started with a lot of machine learning competitions
he found on Zindi.
He lost most of them. But while they were important for learning, they weren't
enough to help him build Naijaweb. "I needed to learn how to build from
scratch."
Building Naijaweb
In 2022, Azeez began his first attempt at web scraping
Nairaland. "I heard people talking a lot about the amount of value
Nairaland possesses, so I decided to try web scraping it."
Unfortunately, this first attempt didn't go very well.
"The script I used back then didn't support synchronous programming."
Synchronous programming means tasks are completed one
after another in a set order, with each step waiting for the previous one to
finish. When he tried again this year, he figured it out, but with credit to
Hugging Face, an open-source platform for ML and data science that created an
easy-to-use library.
The next step now is for Naijaweb to train an LLM, but
that might not happen.
While 230 million GPT2 tokes seem like a lot, in
today's AI age, it is not nearly enough. But what exactly are these tokens?
LLMs understand numbers and not words, the process of
converting words into numbers that LLMs understand is the tokenisation process.
"If we were to tokenise the word CALCULATED, for
example, we could split it into four tokens, CAL-CU-LA-TED. A number will be
assigned to each of these tokens."
This complex process is not the first Azeez has taken
on. He once built a screenshot bot on X in 2022 known as Tweet Shot. According
to him, it was his most viral creation with 170,000 followers.
Azeez said the bot has been acquired by "an
Indian man" although he declined to share how much.
What is next?
Azeez currently works as a Machine Learning Engineer
with HelpMum, a non-profit AI startup dedicated to building solutions that
support maternal and infant healthcare. Between school and his job, he barely
has enough time to do the required research to take his AI skills to the next
level.
However, these are not the only things that stand in
his way. Building AI projects require a lot of computing power and constant
electricity that he does not have. The dataset he created, for example,
required him to keep his laptop running for days.
Using the dataset he has created to train LLMs would
require a very powerful graphics processing unit (GPU). While he could use a
service like Google Colab to get access to high-end GPUs, he'd still need a
good laptop and constant electricity for weeks.
But when it comes to building LLMs, Azeez says that is
not the job of one man, it requires a team of highly skilled machine learning
engineers, some of which he says Nigeria has.
"There are Nigerians that are very skilled in
these things who have gone to do their PhD abroad. I know a UNILAG graduate who
built a small LLM one time."
Azeez even revealed that there is a thriving community
of AI
enthusiasts in Nigerian universities, a group of super-smart people who are
passionate about AI
and ML. Data Science Nigeria is doing a good job of feeding their passions,
he reveals, but with intermittent power supply and unavailability of GPUs, will
these passions amount to anything?
Comments:
Leave a Reply