Projects

Tools

Pashto Spell Checker and Syllable Analytics پښتو هجايي غلطۍ نيونکی

Development Work: This software on this page is developed jointly by the Center of Computational Linguistics (CoCL), FAST National University of Computer & Emerging Sciences (NUCES), Peshawar, and the Pashto Academy (PA), University of Peshawar. Development work is undertaken by Dr. Taimoor Khan and Dr. Omar Usman Khan from CoCL, whereas data curation is by Dr. Nasrullah Wazir from PA. The tools here are in continual development, and made available for purpose of field testing and community feedback.

About Tool: This is a probabilistic tool that checks for isolated non-word errors within some Pashto text. The check is performed against a corpus [a1] of 1.01 million words (.125 million unique words), while corrections are suggested against character shufflings yielding the highest probabilities.

To access the tool and its working details, please visit Pashto Spell Checker

Pashto Thesaurus

Pashto thesaurus built using advanced machine learning approaches to facilitate Pashto speakers. The tool is developed as Google Chrome extension to help users directly select and search words from Pashto web pages. The trained model resides on the server and responds to each query with similar words. The MS(DS) student Samreen Syed (19P-0301) worked on this problem in her MS Thesis

Pashto Thesaurus at Google Chrome Web Store

Urdu Poetry Classification

The project is carried out by BS (CS) students, Ameer Hamza (16P-6083), Manzur Ahmad Mitha Khan (16P-6062) and Raj Bakhtawar (16P-6051). A machine learning model is trained on Urdu poetry with genre of sad, patriotic, religious and comic.

Code of the Urdu poetry classification model with front-end in flutter is accessible here

Semantic Search Engine for Legal Cases

The aim is to develop a Legal Information Retrieval system infused with a state of the art deep learning model that searches by extracting the intent from the query and performs well with longer queries. The model used for this work is from the transformers library that is fine-tuned on classification tasks. Then that model is used to find the semantic similarity between the query and the legal corpora. BS (CS) students Khadija Hayat (17P-6084) and Malik Bilal Rahim (17P-6095) carried out this project as part of their degree.

The code and dataset can be access at Code.

Here is a brief overview of the project in a short video.

Datasets

CVPR 2019 Papers

The dataset is developed as part of the project "Classification of Research Papers Through Language and Style Features" conducted by Aaqib Pervaiz, MS(CS) 16P-7002. The research project aimed at developing a tool that assists reviewers to deal with huge bulk of papers submitted to conferences. It provides an initial flitering mechanism based formatting and style features

TF-IDF based content features
Meta style features

Urdu Poetry Classification Dataset

The dataset is developed by BS (CS) students, Ameer Hamza (16P-6083), Manzur Ahmad Mitha Khan (16P-6062) and Raj Bakhtawar (16P-6051). The dataset was compiled as part of their final year project. It has more than 4000 poetic lines labeled as sad, comic, religious and patriotic.

Urdu poetry classification dataset