
Aibek MakazhanovNazarbayev University | NU · National Laboratory Astana
Aibek Makazhanov
MSc
About
26
Publications
11,113
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
332
Citations
Citations since 2017
Introduction
Currently I am working on computational models of the morphology of Kazakh, an agglutinative Turcik Language. I am also interested in sentiment analysis and web mining.
Additional affiliations
Education
September 2010 - November 2012
September 2003 - July 2008
Publications
Publications (26)
We present an approach to the extraction of family relations from literary
narrative, which incorporates a technique for utterance attribution proposed
recently by Elson and McKeown (2010). In our work this technique is used in
combination with the detection of vocatives - the explicit forms of address
used by the characters in a novel. We take adv...
We present LemMED, a character-level encoder-decoder for contextual morphological analysis (combined lemmatization and tagging). LemMED extends and is named after two other attention-based models, namely Lematus, a contextual lemmatizer, and MED, a morphological (re)inflection model. Our approach does not require training separate lemmatization and...
We present the current results of our ongoing work on develop-ing tools and algorithms for processing Kazakh language in the framework of KazNLP project. The project is motivated by the need in accessible, easy to use, cross-platform, and well-documented automated text processing tools for Kazakh, particularly user generated text, which includes tr...
We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with...
In this work we compare a number of approaches to machine translation (MT) form Russian to Kazakh. We focus specifically on this pair of languages for a number of reasons. First, these languages are relatively understudied in terms of MT research, as well as, natural language processing (NLP) research in general. Kazakh, in particular, has been act...
Annotated corpora of three Turkic languages – Turkish, Kazakh, and Uyghur – were released as part of version 2 of the Free/Open-Source Universal Dependencies (UD) syntactic and morphological annotation guidelines. The objective of these guidelines is to provide consistent dependency annotation to facilitate cross-linguistic comparison.
This paper p...
In this work we address the problems of sentence segmentation and tokenization. Informally the task of sentence segmentation involves splitting a given text into units that satisfy a certain definition (or a number of definitions) of a sentence. Similarly, tokenization has as its goal splitting a text into chunks that for a certain task constitute...
In this exploratory study we attempt to draw parallels between so called dimensions of Big Data (i.e. Volume, Veracity, Variety, Velocity, or 4V-s) and various aspects of machine translation. For this purpose we set up a number of experiments that in our opinion help to relate machine translation to Big Data. In addition to the experiments, which a...
We develop a language-independent, deep learning-based approach to the task of morphological disambiguation. Guided by the intuition that the correct analysis should be “most similar” to the context, we propose dense representations for morphological analyses and surface context and a simple yet effective way of combining the two to perform disambi...
We present the initial results of our experiments on document alignment for the online news domain. Specifically, as apposed to cross-site comparable news alignment, we focus on the identification of parallel documents from within the same multilingual websites. In such a setting parallel news stories oftentimes turn out to be direct translations o...
In this paper we describe a work in progress on designing the continuous vector space word representations able to map unseen data adequately. We propose a LSTM-based feature extraction layer that reads in a sequence of characters corresponding to a word and outputs a single fixed-length realvalued vector. We then test our model on a POS tagging ta...
We present our initial experiments on Russian to Kazakh phrase-based statistical machine translation. Following a common approach to SMT between morphologically rich languages, we employ morphological processing techniques. Namely, for our initial experiments, we perform source-side lemmatization. Given a rather humble-sized parallel corpus at hand...
Sentence alignment is the final step in building parallel corpora, which arguably has the greatest impact on the quality of a resulting corpus and the accuracy of machine translation systems that use it for training. However, the quality of sentence alignment itself depends on a number of factors. In this paper we investigate the impact of several...
Morphological annotation with ambiguity resolution is a process of assigning each token (annotation unit) a single appropriate morphological parse (a triple consisting of <lemma, part of speech tag, a set of grammemes>) in accordance with a predefined annotation scheme. Tokenization criteria constitute an inseparable part of an annotation scheme, b...
The present work is a report on the authors’ fi rst attempt to use the universal dependencies (UD) (de Marneffe et al., 2014) standard for syntactic annotation of Kazakh. The report is a result of a manual annotation of 300 sentences randomly chosen from the Kazakh Language Corpus (Makhambetov et al., 2013). We focus primarily on providing an exten...
The present work is a report on the authors' first attempt to use the universal dependencies (UD) (de Marneffe et al., 2014) standard for syntactic annotation of Kazakh. The report is a result of a manual annotation of 300 sentences randomly chosen from the Kazakh Language Corpus (Makhambetov et al., 2013). We focus primarily on providing an extens...
We propose a method for morphological analysis and disambiguation for Kazakh language that accounts for both inflectional and derivational morphology, including not fully productive derivation. The method is data-driven and does not require manually generated rules. We leverage so called “transition chains” that help pruning false segmentations, wh...
We propose a method for complete morphological analysis of Kazakh language that accounts for both inflectional and derivational morphology. Our method is data-driven and does not require manually generated rules, which makes it convenient for analyzing agglutinative languages. The intuition behind our approach is to label morphemes with so called t...
We compare and discuss various approaches to the problem of part of speech (POS) tagging of texts written in Kazakh, an agglutinative and highly inflectional Turkic language. In Kazakh due to productive morphology a single root may produce hundreds of word forms, and it is difficult to label enough training data to account for a majority of word fo...
We study the problem of predicting the political preference of users on the Twitter network, showing that the political preference of users can be predicted from their Twitter behavior towards political parties. We show this by building prediction models based on a variety of contextual and behavioral features, training the models by resorting to a...
Being an agglutinative language Kazakh imposes certain difficulties on both recognition of correct words and generation of candidate corrections for misspelled words. In this paper we describe a spelling correction method for Kazakh that takes advantage of both morphological analysis and noisy channel-based model. Our method outperforms both open s...
This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a local research community to assemble a Kazakh corpus. KLC is designed to be a large scale corpus containing over 135 million words and conveying five stylistic genres: literary, publicistic, official, scientific and informal. Along with its primar...
We study the problem of predicting the political preference of users on the Twitter network, showing that the political preference of users can be predicted from their interaction with political parties. We show this by building prediction models based on a variety of contextual and behavioural features, training the models by resorting to a distan...
Predicting the positive or negative attitude of individuals towards each other in a social environment has long been of interest, with applications in many domains. We investigate this problem in the context of the collaborative editing of articles in Wikipedia, showing that there is enough information in the edit history of the articles that can b...
































































