Book Your Slot
Submit Poetry
Let’s count them words!
Every word has a story and all stories are made of words. This story is about counting words.
Can we count how many unique words there are in Urdu?
Well, RekhtaDictionary has more than 3.5 lakh of them. That’s an overwhelming number. These words include compound words, idioms and phrases. Also a lot of them are grammatical or dialect-based variations of one another.
So, maybe we should ask a more practical question: how many unique words have been used in Urdu Poetry? In all the ghazals that we have in rekhta, there are roughly 65 thousand unique words used by all the poets. Some of them are used quite frequently and some are seldom used.
We asked the computer to take all the ghazals and count how many times each word has been used in this corpus. It only took a few seconds to do that. Then we asked which is the most commonly occurring word in Urdu Poetry, hoping to gain some groundbreaking insight with the answer. The computer replied and the answer was…totally anti-climatic! Can you guess what it is? Well, it is ‘hai / है /ہے’. Not quite profound, eh? We were not exactly interested in words like hai. So let us ignore them.
Such words which are used so often in a language, are typically filtered out in most Natural Language Processing contexts, and also have a name: ‘stop-words’. So we set out to find a list of stop words for Urdu on the internet, but could not find a decent one. No worries, we can make our own list of stop words. Sorting the list of words by occurrence, and manually picking out these stop words, we made this list, which is publicly published just in case any Urdu NLP researcher finds it useful. It mainly consists of common auxiliary verbs, pronouns and prepositions. After this tangent, let us go back to our original question.
Which word, except for these stop-words, is the most commonly used word in Urdu Poetry? Try to guess once again please.
The answer: dil / दिल /دِل
Makes sense, right? Now we are getting somewhere! Which is the second most used word? It is Gam / ग़म /غم . Right on! Confirms the stereotype! Looks like we are onto something.
Here are the 50 most common words ignoring stop words and verbs in ghazals: dil, Gam, aa.nkh, nazar, baat, zindagii, ishq, duniyaa, mohabbat, yaad, raat, din, KHudaa, KHvaab, shab, shahr, dard, rang, log, vaqt, dar, gul, husn, naam, safar, havaa, haath, raah, vafaa, shaam, jaan, yaar, KHaak, umr, kaam, phuul, dam, haal, KHabar, duur, roz, may, KHayaal, manzil, dariyaa, shauq, suurat, lab, bahaar, zaKHm.
No comments yet. Be the first to comment!