
Yaakov HaCohen-Kerner- Professor, Ph.D.
- Professor (Associate) at Jerusalem College of Technology
Yaakov HaCohen-Kerner
- Professor, Ph.D.
- Professor (Associate) at Jerusalem College of Technology
About
109
Publications
85,857
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,218
Citations
Introduction
Prof. Yaakov HaCohen-Kerner teaches at the CS department of the Jerusalem College of Technology (JCT). Yaakov is a co-author/author of 108 papers. Yaakov's main research domains are text classification, sentiment analysis, author profiling, offensive language detection in social texts, and detection of mental disorders.
Current institution
Additional affiliations
October 2020 - present
July 2015 - present
Publications
Publications (109)
In this study, we aim to detect in social media texts written in Hebrew girls who are suspected of being anorexic. We constructed a dataset containing 100 blog posts written by females who are probably anorexic, and 100 blog posts written by females who are likely to be non-anorexic. The construction of this dataset was supervised and approved by a...
In this research, we extract time-related expressions from a rabbinic text in a semi-automatic manner. These expressions usually appear next to rabbinic references (name / nickname / acronym / book-name). The first step toward our goal is to find all the expressions near references in the corpus. However, not all of the phrases around the reference...
Author profiling from text documents has become a popular task in latest years, in natural language applications. Author profiling is important for various domains such as advertising, marketing, forensics, and security. This survey focuses on profiling age and gender, the two features, which are probably the most researched profile attributes. In...
In this paper, we describe our submissions for the HASOC 2021 contest. We tackled subtask 1A that addresses the problem of hate speech and offensive language identification in three languages: English, Hindi, and Marathi. We developed different models using six classical supervised machine learning methods: support vector classifier, binary support...
In this paper, we describe our submissions for the UrduFake 2021 track. We tackled the task entitled "Fake News Detection in the Urdu Language". We developed different models using three classical supervised machine learning methods: Support Vector Classifier, Random Forest, and Logistic Regression. Our machine learning models were applied to vario...
In this paper, we describe our submissions for PAN at CLEF 2021 contest. We tackled the subtask "Pro-filing Hate Speech Spreaders on Twitter". We developed different models for English and Spanish languages , using classic machine learning methods like Support Vector Classifier, Multi-Layer Perceptron, Logistic Regression, Random Forest, Ada-Boost...
Text classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important component in many text applications. Many of these applications perform preprocessing. There are different types of text preprocessing, e.g., conversion of uppercase letters into lowercase letters, HTML tag removal, stopword...
We introduce in this paper a generic approach to combine implicit crowdsourcing and language learning in order to mass-produce language resources (LRs) for any language for which a crowd of language learners can be involved. We present the approach by explaining its core paradigm that consists in pairing specific types of LRs with specific exercise...
In the technological age, the phenomenon of complaint letters published on the Internet is increasing. Therefore, it is important to automatically classify complaint letters according to various criteria, such as company categories. In this research, we investigated the automatic text classification of complaint letters written in Hebrew that were...
Aim/Purpose: Finding and tagging citation on an ancient Hebrew religious document. These documents have no structured citations and have no bibliography. Background: We look for common patterns within Hebrew religious texts. Methodology: We developed a method that goes over the texts and extracts sentences con-taining the names of three famous auth...
Author profiling deals with the identification of various details about the author of the text (e.g., age and gender). In this paper, we describe the participation of our team (hacohenkerner19) in the PAN 2019 shared task on Bots and Gender Profiling in two languages: English and Spanish. Given a Twitter feed, we should determine whether its author...
The increase in online social media provides an excellent platform for companies, on the one hand, to read and learn from customers and improve their marketing policies and on the other hand, to perform various analysis and classification tasks. In this short paper, we present basic statistics of the domains of text classification, sentiment analys...
Hundreds of millions of people worldwide suffer from various mental
disorders. Recent studies have shown that using text classification models, some of the disorders can be identified by intelligent analysis of the texts written by these people. Another related domain is the automatic classification of documents according to their authors’ mental...
Text classification (TC) is an important component in many research
domains, such as information extraction, information retrieval, and text mining. Therefore, the question of whether and how TC can be generally improved is crucial. In this research, we implemented a method that partially resembles the multiplication method. An example of the mult...
Hundreds of millions of people worldwide suffer from various mental disorders. Recent studies have shown that using text classification models, some of the dis-orders can be identified by intelligent analysis of the texts written by these people. Another related domain is the automatic classification of documents according to their mental state usi...
This article presents a unique method in text and data mining for finding the era, i.e., mining temporal data, in which an anonymous author was living. Finding this era can assist in the examination of a fake document or extracting the time period in which a writer lived. The study and the experiments concern Hebrew, and in some parts, Aramaic and...
Multiword expressions (MWEs) are known as a “pain in the neck” due to their idiosyncratic behaviour. While some categories of MWEs have been largely studied, verbal MWEs (VMWEs) such as to take a walk, to break one’s heart or to turn off have been relatively rarely modelled. We describe an initiative meant to bring about substantial progress in und...
Author profiling deals with identification of various details about the author of the text (e.g., age, cultural background, gender, native language, personality). In this paper, we describe the participation of our teams (yigal18 and miller18, both teams contain the same people, but in another order) in the PAN 2018 shared task on author profiling,...
Authorship Attribution deals with identifying the author of an anonymous text, i.e., to attribute each test text of unknown authorship to one of a set of known authors, whose training texts are given. In this paper, we describe the participation of our teams (miller18 and yigal18, both teams contain the same people, but in another order) in the PAN...
Classification is one of the most fundamental tasks in data mining and machine learning. It is being applied in an increasing number of fields, e.g. filtering, identification, information retrieval, information extraction, and similarity detection. A basic and necessary condition for the success of a classification task is the proper representation...
Verbal Multi-Word Expressions (VMWEs) are very common in many languages. They include among other types the following types: Verb-Particle Constructions (VPC) (e.g. get around), Light-Verb Constructions (LVC) (e.g. make a decision), and idioms (ID) (e.g. break a leg). In this paper, we present a new dataset for supervised learning of VMWEs written...
Text Classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important
component in many text applications such as text indexing, information extraction, information retrieval, text mining, and word
sense disambiguation. In this paper, we present an alternative method of feature reduction - a c...
Social posts and their comments are rich and interesting social data. In this study, we aim to classify comments as relevant or irrelevant to the content of their posts. Since the comments in social media are usually short, their bag-of-words (BoW) representations are highly sparse. We investigate four semantic vector representations for the releva...
Many language identification (LID) systems are based on language models using techniques that consider the fluctuation of speech over time. Considering these fluctuations necessitates longer recording intervals to obtain reasonable accuracy. Our research extracts features from short recording intervals to enable successful classification of spoken...
In this research, we focus on automatic supervised stance classification of tweets. Given test datasets of tweets from five various topics, we try to classify the stance of the tweet authors as either in FAVOR of the target, AGAINST it, or NONE. We apply eight variants of seven supervised machine learning methods and three filtering methods using t...
A wide range of studies carried out in the field of automatic correction, especially in spelling corrections of single words, and improving execution time and storage place. However, to the best of our knowledge, there are no studies in the field of repairing words included in multi-word quotations and retrieving possible sources of these quotation...
This study is trying to determine the time-frame in which the author of a given document lived. The documents are rabbinic documents written in Hebrew-Aramaic languages. The documents are undated and do not contain a bibliographic section, which leaves us with an interesting challenge. To do this, we define a set of key-phrases and formulate variou...
Language models (LMs) are important components of many applications that work with natural language, such as word prediction and completion programs, automatic speech recognition, and machine translation. In this paper, we introduce various types of improvements for LMs dealing with word prediction and completion in Hebrew. Whereas previous systems...
In this research, given a corpus containing blog posts written in Hebrew and two seed sentiment lists, we analyze the positive and negative sentences included in the corpus, and special groups of words that are associated with the positive and negative seed words. We discovered many new negative words (around half of the top 50 words) but only one...
Identification of Multi-Word Expressions (MWEs) lies at the heart of many natural language processing applications. In this research, we deal with a particular type of Hebrew MWEs, Verb-Noun MWEs (VN-MWEs), which combine a verb and a noun with or without other words. Most prior work on MWEs classification focused on linguistic and statistical infor...
False story detection is an important and challenging problem. This paper presents a simple and sound methodology that is able to automatically distinguish between true and false Hebrew stories using either psychological or semantic information. The examined corpus contains 96 stories that were composed by 48 native Hebrew speakers who were asked t...
This research is concerned with the detection of similar academic papers. Given a tested
paper from a given corpus of 10,099 peer-reviewed scientific papers, a two-stage process
was activated. During the first stage, most of the papers were filtered out using a fast filter
method. In the second stage, in order to detect similar papers we applied 23...
A verb-noun Multi-Word Expression (MWE) is a combination of a verb and a noun with or without other words, in which the combination has a meaning different from the meaning of the words considered separately. In this paper, we present a new lexical resource of Hebrew Verb-Noun MWEs (VN-MWEs). The VN-MWEs of this resource were manually collected and...
This paper analyzes what linguistic features differentiate true and false stories written in Hebrew. To do so, we have defined four feature sets containing 145 features: POS-tags, quantitative, repetition, and special expressions. The examined corpus contains stories that were composed by 48 native Hebrew speakers who were asked to tell both false...
Many of the language identification (LID) systems are based on language models using machine learning (ML) techniques that take into account the fluctuation of speech over time, such as Hidden Markov Models (HMM). Considering the fluctuation of speech results LID systems use relatively long recording intervals to obtain reasonable accuracy. This re...
In this paper, we present a comparative study of news documents classification using various supervised machine learning methods and different combinations of key-phrases (word N-grams extracted from text) and visual features (extracted from a representative image from each document). The application domain is news documents written in English that...
In this study, we try to determine the time-frame in which the author of a given document lived. The discussed documents are rabbinic documents written in the Hebrew, Aramaic and Yiddish languages. The documents are usually undated and do not contain a bibliographic section, which leaves us with an interesting challenge to determine the desired tim...
This research investigates the problem of news articles classification.
The classification is performed using N-gram textual features extracted from
text and visual features generated from one representative image. The application
domain is news articles written in English that belong to four categories:
Business-Finance, Lifestyle-Leisure, Science...
In this research, we identify the era in which the author of the given
document(s) lived. For rabbinic documents written in Hebrew-Aramaic, which
are usually undated and do not contain any bibliographic section, this problem
is important. The aim of this research is to find in which years an author was
born and died, based on his documents and...
In this paper, we describe various language models (LMs) and combinations created to support word prediction and completion in Hebrew. We define and apply 5 general types of LMs: (1) Basic LMs (unigrams, bigrams, trigrams, and quadgrams), (2) Backoff LMs, (3) LMs Integrated with tagged LMs, (4) Interpolated LMs, and (5) Interpolated LMs Integrated...
Many current documents include multimedia consisting of text, images and embedded videos. This paper presents a general method that uses Random Forests to automatically extract keyphrases that can be used as very short summaries and to help in retrieval, classification and clustering processes.
Authorship attribution of text documents is a “hot” domain in research; however, almost all of its applications use supervised machine learning (ML) methods. In this research, we explore authorship attribution as a clustering problem, that is, we attempt to complete the task of authorship attribution using unsupervised machine learning methods. The...
In the present paper, we illustrate on animal names (zoonyms) the specification and design of the phono-semantic matching (PSM) module which within the architecture of GALLURA, should be upstream in the control flow. The PSM module takes a word (e.g., a zoonym, or then a place-name) and an indication of a target-language (in practice, Hebrew). The...
Stemming is useful for various natural language processing tasks, such as document indexing and text classification. Therefore, identification of the correct root of any given word is important. For Hebrew this is not a trivial task, due to the complex nature of Hebrew morphology and its orthography. Many Hebrew words are ambiguous in the sense tha...
One category of software for the cultural heritage is tools intended to enhance the fruition texts of a literary canon. This overlaps much of humanities computing. This paper is about a project, now in the design phase, of a software tool that would help readers to understand homiletical derivations as proposed in the texts of the rabbinic so-calle...
This paper analyzes what stylistic characteristics differentiate
different styles of writing, and specifically types of different A-level computer
science articles. To do so, we compared various full papers using stylistic
feature sets and a supervised machine learning method. We report on the
success of this approach in identifying papers from the...
This research investigates whether it is appropriate to use word lists
as features for clustering documents to their authors, to the documents'
countries of origin or to the historical periods in which they were written. We
have defined three kinds of word lists: most frequent words (FW) including
function words (stopwords), most frequent filtered...
Disambiguation of ambiguous initialisms and acronyms is critical to the proper understanding of various types of texts. A model that attempts to solve this has previously been presented. This model contained various baseline features, including contextual relationship features, statistical features, and language-specific features. The domain of Jew...
In this project, we investigate the generation of wordplay that can serve
as playful “explanations” for given names. We present a working system (part of
work in progress), which segments and/or manipulates input names. The system
does so by decomposing them into sequences (or phrases) composed of at least two
words and/or transforming them into ot...
This research aims to improve keystroke savings for completion and
prediction of Hebrew words. This task is very important to augmentative and
alternative communication systems as well as to search engines, short messages
services, and mobile phones. The proposed model is composed of Hebrew
corpora containing 177M words, a morphological analyzer, v...
In this research, we investigate the issue of efficient detection of
similar academic papers. Given a specific paper, and a corpus of academic
papers, most of the papers from the corpus are filtered out using a fast filter
method. Then, 47 methods (baseline methods and combinations of them) are
applied to detect similar papers, where 34 of the meth...
Information retrieval (IR) and, all the more so, knowledge discovery (KD), do not exist in isolation: it is necessary to consider the architectural context in which they are invoked in order to fulfil given kinds of tasks. This paper discusses a retrieval-intensive context of use, whose intended output is the generation of narrative explanations in...
Citations in documents contain important information about the sources that authors cite and their importance and impact. Therefore, automatic identification of citations from documents is an important task. Citations included in rabbinic literature are more difficult to identify and to extract than citations in scientific papers written in English...
This research investigates classification of documents according to the ethnic group of their authors and/or to the historical period when the documents were written. The classification is done using various combinations of six sets of stylistic features: quantitative, orthographic, topographic, lexical, function, and vocabulary richness. The appli...
In many languages abbreviations are very common and are widely used in both written and spoken language. However, they are not always explicitly defined and in many cases they are ambiguous. This research presents a process that attempts to solve the problem of abbreviation ambiguity using modern machine learning (ML) techniques. Various baseline f...
Plagiarism is the use of the language and
thoughts of another work and the representation
of them as one's own original
work. Various levels of plagiarism exist
in many domains in general and in academic
papers in particular. Therefore, diverse
efforts are taken to automatically
identify plagiarism. In this research, we
developed software capable o...
Precious historical treasures might be hidden between the lines of a text. There are many implicit details which can be extracted
from a text, particularly if one has access to an entire corpus of texts pertaining to the given subject. One of these details
is the identification of the era in which the author of the given document(s) lived. For rabb...
Document classification presents challenges due to the large number of features, their dependencies, and the large number of training documents. In this research, we investigated the use of six stylistic feature sets (including 42 features) and/or six name-based feature sets (including 234 features) for various combinations of the following classif...
X-ray diffractometry, within materials engineering, is a promising area of application for case-based reasoning. A large database of spectral diffraction patterns includes entries with different quality marks; moreover, several diffraction patterns happen to be equivalent, identifying the same material (crystalline phase), even though it also happe...
Abbreviations are very common and are widely used in both written and spoken language. However, they are not always explicitly
defined and in many cases they are ambiguous. In this research, we present a process that attempts to solve the problem of
abbreviation ambiguity. Various features have been explored, including context-related methods and s...
ראשי -תיבות (ר"ת) נפוצים מאד בשימוש בשפה העברית בכלל ובכתבים תורניים בפרט. חלק ניכר מר"ת אלו
ניתנים לפירוש במספר אופנים. במאמר זה, אנו מציגים מערכת לפענוח ר"ת רב- משמעיים שפותחה במכון- לב
(בית הספר הגבוה לטכנולוגי-ה, י- ם) ע"י הכותבים השני והשלישי תחת הנחייתו של הכותב הראשון. הפענוח
התמקד בכתבים תורניים הכתובים בעברית-ארמית. פותחו שמונה- עשרה שיטות...
Text classification presents challenges due to the large number of features, their dependencies, and the large number of training documents. In this research, we investigate whether the use of words as features is appropriate for classification of documents to the ethnic group of their authors and/or to the historical period when they were written....
A process that attempts to solve abbreviation ambiguity is presented. Various context- related features and statistical features have been explored. Almost all features are domain independent and language independent. The application domain is Jewish Law documents written in Hebrew. Such documents are known to be rich in ambiguous abbreviations. Va...
This paper describes the construction of the prototype of the first system that creates Torah sermons. The application domain was honoring and respecting parents and Jewish sages. The system creates sermons automatically without demanding prior knowledge from the user. The sermons are created using different characteristics of Torah sermons and dif...
Keyphrases extracted from documents may save precious time for tasks such as filtering, summarization, and categorization. A few such systems are available for documents written in English. In this paper, we propose a model called LEH_KEY (Learning to Extract Hebrew KEYphrases) that for the first time learns to extract keyphrases for documents writ...
Semitic language processing in general is of great interest today. However, the Hebrew and Aramaic languages have been relatively little studied. In this study, we investigate how to classify Jewish Law articles written in these languages according to the ethnic group of their authors. The motivation is to investigate the cultural differences in wr...
Text Summarization is a research domain that attracts many research groups around the scientific world. It is the process of automatically creating a condensed version of a given text that provides useful information for the user. Semitic language processing in general is of great interest today. However, the Hebrew language has been relatively lit...
Text classification is an important and challenging research domain. In this paper, identifying historical period and ethnic
origin of documents using stylistic feature sets is investigated. The application domain is Jewish Law articles written in
Hebrew-Aramaic. Such documents present various interesting problems for stylistic classification. Firs...
Computer composition of high-quality chess mate problems is relatively an uninvestigated research domain. A previous model, called an Improver of Chess Problems, which is based on a hill-climbing search, improved slightly the quality of 10 out of 36 known problems (about 28%). In this article, we describe an improved model, called Deep Improver of...
Computerized composers of chess mate problems are very rare. Moreover, they do not produce neither impressive nor creative new chess mate problems. In this paper, we describe a model called Chess Composer. This model uses a 64-bit representation, an ordered version of Iterative Deepening Depth First Search, and a quality function built with the hel...
In this paper we describe game-independent strategies, capable of learning explanation patterns (XPs) for evaluation of any basic game pattern. A basic game pattern is defined as a minimal configuration of a small number of pieces and squares which describes only one salient game feature. Each basic pattern can be evaluated by a suitable XP. We hav...
Many academic journals and conferences require that each article include a list of keyphrases. These keyphrases should provide general information about the contents and the topics of the article. Keyphrases may save precious time for tasks such as filtering, summarization, and categorization. In this paper, we investigate automatic extraction and...
In the Hebrew language, many words have for each one, a few possible stems. However, for a given word in a context of a specific sentence in a specific paragraph in a specific document, each word has only one correct stem. We have developed seven baseline methods in order to find the correct stem for a given word. These methods use contexts, declen...
Most documents do not include keyphrases. There are a few keyphrase extraction systems for documents
written in English. However, there is no such a system for the Hebrew language. In this ongoing work, we investigate baseline methods that extract keyphrases from Hebrew news HTML documents. These methods have been tested on a set of documents. Each...
Presented here is a natural and intuitively clear O(n**2) heuristics that offers a solution to the maximum matching problem in bipartite graphs. The heuristic is based on a greedy technique and uses global information about neighborhoods for choosing pairs of mertices. Assuming that the bipartite graph is represented by a 0-1 matrix in which rows r...
Computer checkers programs achieve outstanding results at playing checkers.
However, no existing program can either compose or improve adequate checkers
compositions. In this paper, we present a model that is capable of improving the quality of a
part of the existing checkers compositions. In this model we attempt to improve a given
composition by...
In many languages, abbreviations are widely used either in writing or talking. However, abbreviations are likely to be ambiguous.
Therefore, there is a need for disambiguation. That is, abbreviations should be expanded correctly. Disambiguation of abbreviations
is critical for correct understanding not only for the abbreviations themselves but also...
The rapid increasing of online information is hard to handle. Summaries such as abstracts help us to reduce this problem.
Keywords, which can be regarded as very short summaries, may help even more. Filtering documents by using keywords may save
precious time while searching. However, most of the documents do not include keywords. In this paper we...
The Judge's Apprentice is a case-based decision support system implemented and intended for use in Israeli criminal law to aid sentencing in cases of either robbery or rape. The system uses a sentencing tree, which is a hierarchical classification of 371 legal concepts relevant to criminal sentencing. Each leaf in this tree represents an index, whi...
Questions
Questions (25)
Hello everyone,
Do you think that H-index is a reliable measure that well reflects the citations' level of the researcher's articles?
Dear Colleagues,
I would appreciate sending links for datasets or lists of words in the Hindi and/or Marathi languages related to any of the following word sets:
- coursings
- abusive words
- sexual words
- hated words
- profane words
- treatment of women
- first/second/third person pronouns
Thanks in advance,
Yaakov
Dear Colleagues,
I would appreciate sending links for datasets or lists of words related to any of the following word sets:
- coursings
- abusive words
- sexual words
- hated words
- profane words
- treatment of women
- first/second/third person pronouns
in the English and/or Hindi and/or Marathi languages.
Thanks in advance,
Yaakov






























































































































