Yaakov HaCohen-Kerner

Yaakov HaCohen-Kerner
Verified
Yaakov verified their affiliation via an institutional email.
Verified
Yaakov verified their affiliation via an institutional email.
  • Professor, Ph.D.
  • Professor (Associate) at Jerusalem College of Technology

About

109
Publications
85,857
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,218
Citations
Introduction
Prof. Yaakov HaCohen-Kerner teaches at the CS department of the Jerusalem College of Technology (JCT). Yaakov is a co-author/author of 108 papers. Yaakov's main research domains are text classification, sentiment analysis, author profiling, offensive language detection in social texts, and detection of mental disorders.
Current institution
Jerusalem College of Technology
Current position
  • Professor (Associate)
Additional affiliations
October 2020 - present
Jerusalem College of Technology
July 2015 - present
Jerusalem College of Technology
Position
  • Head of Department

Publications

Publications (109)
Article
Full-text available
In this study, we aim to detect in social media texts written in Hebrew girls who are suspected of being anorexic. We constructed a dataset containing 100 blog posts written by females who are probably anorexic, and 100 blog posts written by females who are likely to be non-anorexic. The construction of this dataset was supervised and approved by a...
Article
Full-text available
In this research, we extract time-related expressions from a rabbinic text in a semi-automatic manner. These expressions usually appear next to rabbinic references (name / nickname / acronym / book-name). The first step toward our goal is to find all the expressions near references in the corpus. However, not all of the phrases around the reference...
Article
Author profiling from text documents has become a popular task in latest years, in natural language applications. Author profiling is important for various domains such as advertising, marketing, forensics, and security. This survey focuses on profiling age and gender, the two features, which are probably the most researched profile attributes. In...
Conference Paper
Full-text available
In this paper, we describe our submissions for the HASOC 2021 contest. We tackled subtask 1A that addresses the problem of hate speech and offensive language identification in three languages: English, Hindi, and Marathi. We developed different models using six classical supervised machine learning methods: support vector classifier, binary support...
Conference Paper
Full-text available
In this paper, we describe our submissions for the UrduFake 2021 track. We tackled the task entitled "Fake News Detection in the Urdu Language". We developed different models using three classical supervised machine learning methods: Support Vector Classifier, Random Forest, and Logistic Regression. Our machine learning models were applied to vario...
Conference Paper
Full-text available
In this paper, we describe our submissions for PAN at CLEF 2021 contest. We tackled the subtask "Pro-filing Hate Speech Spreaders on Twitter". We developed different models for English and Spanish languages , using classic machine learning methods like Support Vector Classifier, Multi-Layer Perceptron, Logistic Regression, Random Forest, Ada-Boost...
Article
Full-text available
Text classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important component in many text applications. Many of these applications perform preprocessing. There are different types of text preprocessing, e.g., conversion of uppercase letters into lowercase letters, HTML tag removal, stopword...
Conference Paper
Full-text available
We introduce in this paper a generic approach to combine implicit crowdsourcing and language learning in order to mass-produce language resources (LRs) for any language for which a crowd of language learners can be involved. We present the approach by explaining its core paradigm that consists in pairing specific types of LRs with specific exercise...
Article
In the technological age, the phenomenon of complaint letters published on the Internet is increasing. Therefore, it is important to automatically classify complaint letters according to various criteria, such as company categories. In this research, we investigated the automatic text classification of complaint letters written in Hebrew that were...
Conference Paper
Full-text available
Aim/Purpose: Finding and tagging citation on an ancient Hebrew religious document. These documents have no structured citations and have no bibliography. Background: We look for common patterns within Hebrew religious texts. Methodology: We developed a method that goes over the texts and extracts sentences con-taining the names of three famous auth...
Conference Paper
Full-text available
Author profiling deals with the identification of various details about the author of the text (e.g., age and gender). In this paper, we describe the participation of our team (hacohenkerner19) in the PAN 2019 shared task on Bots and Gender Profiling in two languages: English and Spanish. Given a Twitter feed, we should determine whether its author...
Conference Paper
Full-text available
The increase in online social media provides an excellent platform for companies, on the one hand, to read and learn from customers and improve their marketing policies and on the other hand, to perform various analysis and classification tasks. In this short paper, we present basic statistics of the domains of text classification, sentiment analys...
Conference Paper
Full-text available
Hundreds of millions of people worldwide suffer from various mental disorders. Recent studies have shown that using text classification models, some of the disorders can be identified by intelligent analysis of the texts written by these people. Another related domain is the automatic classification of documents according to their authors’ mental...
Conference Paper
Full-text available
Text classification (TC) is an important component in many research domains, such as information extraction, information retrieval, and text mining. Therefore, the question of whether and how TC can be generally improved is crucial. In this research, we implemented a method that partially resembles the multiplication method. An example of the mult...
Article
Full-text available
Hundreds of millions of people worldwide suffer from various mental disorders. Recent studies have shown that using text classification models, some of the dis-orders can be identified by intelligent analysis of the texts written by these people. Another related domain is the automatic classification of documents according to their mental state usi...
Article
Full-text available
This article presents a unique method in text and data mining for finding the era, i.e., mining temporal data, in which an anonymous author was living. Finding this era can assist in the examination of a fake document or extracting the time period in which a writer lived. The study and the experiments concern Hebrew, and in some parts, Aramaic and...
Book
Full-text available
Multiword expressions (MWEs) are known as a “pain in the neck” due to their idiosyncratic behaviour. While some categories of MWEs have been largely studied, verbal MWEs (VMWEs) such as to take a walk, to break one’s heart or to turn off have been relatively rarely modelled. We describe an initiative meant to bring about substantial progress in und...
Conference Paper
Full-text available
Author profiling deals with identification of various details about the author of the text (e.g., age, cultural background, gender, native language, personality). In this paper, we describe the participation of our teams (yigal18 and miller18, both teams contain the same people, but in another order) in the PAN 2018 shared task on author profiling,...
Conference Paper
Full-text available
Authorship Attribution deals with identifying the author of an anonymous text, i.e., to attribute each test text of unknown authorship to one of a set of known authors, whose training texts are given. In this paper, we describe the participation of our teams (miller18 and yigal18, both teams contain the same people, but in another order) in the PAN...
Chapter
Classification is one of the most fundamental tasks in data mining and machine learning. It is being applied in an increasing number of fields, e.g. filtering, identification, information retrieval, information extraction, and similarity detection. A basic and necessary condition for the success of a classification task is the proper representation...
Chapter
Verbal Multi-Word Expressions (VMWEs) are very common in many languages. They include among other types the following types: Verb-Particle Constructions (VPC) (e.g. get around), Light-Verb Constructions (LVC) (e.g. make a decision), and idioms (ID) (e.g. break a leg). In this paper, we present a new dataset for supervised learning of VMWEs written...
Article
Full-text available
Text Classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important component in many text applications such as text indexing, information extraction, information retrieval, text mining, and word sense disambiguation. In this paper, we present an alternative method of feature reduction - a c...
Chapter
Social posts and their comments are rich and interesting social data. In this study, we aim to classify comments as relevant or irrelevant to the content of their posts. Since the comments in social media are usually short, their bag-of-words (BoW) representations are highly sparse. We investigate four semantic vector representations for the releva...
Article
Many language identification (LID) systems are based on language models using techniques that consider the fluctuation of speech over time. Considering these fluctuations necessitates longer recording intervals to obtain reasonable accuracy. Our research extracts features from short recording intervals to enable successful classification of spoken...
Conference Paper
Full-text available
In this research, we focus on automatic supervised stance classification of tweets. Given test datasets of tweets from five various topics, we try to classify the stance of the tweet authors as either in FAVOR of the target, AGAINST it, or NONE. We apply eight variants of seven supervised machine learning methods and three filtering methods using t...
Article
A wide range of studies carried out in the field of automatic correction, especially in spelling corrections of single words, and improving execution time and storage place. However, to the best of our knowledge, there are no studies in the field of repairing words included in multi-word quotations and retrieving possible sources of these quotation...
Chapter
This study is trying to determine the time-frame in which the author of a given document lived. The documents are rabbinic documents written in Hebrew-Aramaic languages. The documents are undated and do not contain a bibliographic section, which leaves us with an interesting challenge. To do this, we define a set of key-phrases and formulate variou...
Article
Language models (LMs) are important components of many applications that work with natural language, such as word prediction and completion programs, automatic speech recognition, and machine translation. In this paper, we introduce various types of improvements for LMs dealing with word prediction and completion in Hebrew. Whereas previous systems...
Article
Full-text available
In this research, given a corpus containing blog posts written in Hebrew and two seed sentiment lists, we analyze the positive and negative sentences included in the corpus, and special groups of words that are associated with the positive and negative seed words. We discovered many new negative words (around half of the top 50 words) but only one...
Conference Paper
Full-text available
Identification of Multi-Word Expressions (MWEs) lies at the heart of many natural language processing applications. In this research, we deal with a particular type of Hebrew MWEs, Verb-Noun MWEs (VN-MWEs), which combine a verb and a noun with or without other words. Most prior work on MWEs classification focused on linguistic and statistical infor...
Article
False story detection is an important and challenging problem. This paper presents a simple and sound methodology that is able to automatically distinguish between true and false Hebrew stories using either psychological or semantic information. The examined corpus contains 96 stories that were composed by 48 native Hebrew speakers who were asked t...
Article
This research is concerned with the detection of similar academic papers. Given a tested paper from a given corpus of 10,099 peer-reviewed scientific papers, a two-stage process was activated. During the first stage, most of the papers were filtered out using a fast filter method. In the second stage, in order to detect similar papers we applied 23...
Conference Paper
Full-text available
A verb-noun Multi-Word Expression (MWE) is a combination of a verb and a noun with or without other words, in which the combination has a meaning different from the meaning of the words considered separately. In this paper, we present a new lexical resource of Hebrew Verb-Noun MWEs (VN-MWEs). The VN-MWEs of this resource were manually collected and...
Conference Paper
Full-text available
This paper analyzes what linguistic features differentiate true and false stories written in Hebrew. To do so, we have defined four feature sets containing 145 features: POS-tags, quantitative, repetition, and special expressions. The examined corpus contains stories that were composed by 48 native Hebrew speakers who were asked to tell both false...
Conference Paper
Full-text available
Many of the language identification (LID) systems are based on language models using machine learning (ML) techniques that take into account the fluctuation of speech over time, such as Hidden Markov Models (HMM). Considering the fluctuation of speech results LID systems use relatively long recording intervals to obtain reasonable accuracy. This re...
Conference Paper
In this paper, we present a comparative study of news documents classification using various supervised machine learning methods and different combinations of key-phrases (word N-grams extracted from text) and visual features (extracted from a representative image from each document). The application domain is news documents written in English that...
Conference Paper
In this study, we try to determine the time-frame in which the author of a given document lived. The discussed documents are rabbinic documents written in the Hebrew, Aramaic and Yiddish languages. The documents are usually undated and do not contain a bibliographic section, which leaves us with an interesting challenge to determine the desired tim...
Conference Paper
Full-text available
This research investigates the problem of news articles classification. The classification is performed using N-gram textual features extracted from text and visual features generated from one representative image. The application domain is news articles written in English that belong to four categories: Business-Finance, Lifestyle-Leisure, Science...
Conference Paper
Full-text available
In this research, we identify the era in which the author of the given document(s) lived. For rabbinic documents written in Hebrew-Aramaic, which are usually undated and do not contain any bibliographic section, this problem is important. The aim of this research is to find in which years an author was born and died, based on his documents and...
Conference Paper
Full-text available
In this paper, we describe various language models (LMs) and combinations created to support word prediction and completion in Hebrew. We define and apply 5 general types of LMs: (1) Basic LMs (unigrams, bigrams, trigrams, and quadgrams), (2) Backoff LMs, (3) LMs Integrated with tagged LMs, (4) Interpolated LMs, and (5) Interpolated LMs Integrated...
Conference Paper
Full-text available
Many current documents include multimedia consisting of text, images and embedded videos. This paper presents a general method that uses Random Forests to automatically extract keyphrases that can be used as very short summaries and to help in retrieval, classification and clustering processes.
Article
Authorship attribution of text documents is a “hot” domain in research; however, almost all of its applications use supervised machine learning (ML) methods. In this research, we explore authorship attribution as a clustering problem, that is, we attempt to complete the task of authorship attribution using unsupervised machine learning methods. The...
Chapter
In the present paper, we illustrate on animal names (zoonyms) the specification and design of the phono-semantic matching (PSM) module which within the architecture of GALLURA, should be upstream in the control flow. The PSM module takes a word (e.g., a zoonym, or then a place-name) and an indication of a target-language (in practice, Hebrew). The...
Chapter
Stemming is useful for various natural language processing tasks, such as document indexing and text classification. Therefore, identification of the correct root of any given word is important. For Hebrew this is not a trivial task, due to the complex nature of Hebrew morphology and its orthography. Many Hebrew words are ambiguous in the sense tha...
Chapter
One category of software for the cultural heritage is tools intended to enhance the fruition texts of a literary canon. This overlaps much of humanities computing. This paper is about a project, now in the design phase, of a software tool that would help readers to understand homiletical derivations as proposed in the texts of the rabbinic so-calle...
Conference Paper
Full-text available
This paper analyzes what stylistic characteristics differentiate different styles of writing, and specifically types of different A-level computer science articles. To do so, we compared various full papers using stylistic feature sets and a supervised machine learning method. We report on the success of this approach in identifying papers from the...
Conference Paper
This research investigates whether it is appropriate to use word lists as features for clustering documents to their authors, to the documents' countries of origin or to the historical periods in which they were written. We have defined three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered...
Article
Disambiguation of ambiguous initialisms and acronyms is critical to the proper understanding of various types of texts. A model that attempts to solve this has previously been presented. This model contained various baseline features, including contextual relationship features, statistical features, and language-specific features. The domain of Jew...
Conference Paper
In this project, we investigate the generation of wordplay that can serve as playful “explanations” for given names. We present a working system (part of work in progress), which segments and/or manipulates input names. The system does so by decomposing them into sequences (or phrases) composed of at least two words and/or transforming them into ot...
Conference Paper
Full-text available
This research aims to improve keystroke savings for completion and prediction of Hebrew words. This task is very important to augmentative and alternative communication systems as well as to search engines, short messages services, and mobile phones. The proposed model is composed of Hebrew corpora containing 177M words, a morphological analyzer, v...
Conference Paper
In this research, we investigate the issue of efficient detection of similar academic papers. Given a specific paper, and a corpus of academic papers, most of the papers from the corpus are filtered out using a fast filter method. Then, 47 methods (baseline methods and combinations of them) are applied to detect similar papers, where 34 of the meth...
Conference Paper
Information retrieval (IR) and, all the more so, knowledge discovery (KD), do not exist in isolation: it is necessary to consider the architectural context in which they are invoked in order to fulfil given kinds of tasks. This paper discusses a retrieval-intensive context of use, whose intended output is the generation of narrative explanations in...
Article
Full-text available
Citations in documents contain important information about the sources that authors cite and their importance and impact. Therefore, automatic identification of citations from documents is an important task. Citations included in rabbinic literature are more difficult to identify and to extract than citations in scientific papers written in English...
Article
Full-text available
This research investigates classification of documents according to the ethnic group of their authors and/or to the historical period when the documents were written. The classification is done using various combinations of six sets of stylistic features: quantitative, orthographic, topographic, lexical, function, and vocabulary richness. The appli...
Article
In many languages abbreviations are very common and are widely used in both written and spoken language. However, they are not always explicitly defined and in many cases they are ambiguous. This research presents a process that attempts to solve the problem of abbreviation ambiguity using modern machine learning (ML) techniques. Various baseline f...
Conference Paper
Full-text available
Plagiarism is the use of the language and thoughts of another work and the representation of them as one's own original work. Various levels of plagiarism exist in many domains in general and in academic papers in particular. Therefore, diverse efforts are taken to automatically identify plagiarism. In this research, we developed software capable o...
Conference Paper
Full-text available
Precious historical treasures might be hidden between the lines of a text. There are many implicit details which can be extracted from a text, particularly if one has access to an entire corpus of texts pertaining to the given subject. One of these details is the identification of the era in which the author of the given document(s) lived. For rabb...
Article
Document classification presents challenges due to the large number of features, their dependencies, and the large number of training documents. In this research, we investigated the use of six stylistic feature sets (including 42 features) and/or six name-based feature sets (including 234 features) for various combinations of the following classif...
Article
Full-text available
X-ray diffractometry, within materials engineering, is a promising area of application for case-based reasoning. A large database of spectral diffraction patterns includes entries with different quality marks; moreover, several diffraction patterns happen to be equivalent, identifying the same material (crystalline phase), even though it also happe...
Conference Paper
Full-text available
Abbreviations are very common and are widely used in both written and spoken language. However, they are not always explicitly defined and in many cases they are ambiguous. In this research, we present a process that attempts to solve the problem of abbreviation ambiguity. Various features have been explored, including context-related methods and s...
Article
Full-text available
ראשי -תיבות (ר"ת) נפוצים מאד בשימוש בשפה העברית בכלל ובכתבים תורניים בפרט. חלק ניכר מר"ת אלו ניתנים לפירוש במספר אופנים. במאמר זה, אנו מציגים מערכת לפענוח ר"ת רב- משמעיים שפותחה במכון- לב (בית הספר הגבוה לטכנולוגי-ה, י- ם) ע"י הכותבים השני והשלישי תחת הנחייתו של הכותב הראשון. הפענוח התמקד בכתבים תורניים הכתובים בעברית-ארמית. פותחו שמונה- עשרה שיטות...
Article
Text classification presents challenges due to the large number of features, their dependencies, and the large number of training documents. In this research, we investigate whether the use of words as features is appropriate for classification of documents to the ethnic group of their authors and/or to the historical period when they were written....
Conference Paper
Full-text available
A process that attempts to solve abbreviation ambiguity is presented. Various context- related features and statistical features have been explored. Almost all features are domain independent and language independent. The application domain is Jewish Law documents written in Hebrew. Such documents are known to be rich in ambiguous abbreviations. Va...
Article
This paper describes the construction of the prototype of the first system that creates Torah sermons. The application domain was honoring and respecting parents and Jewish sages. The system creates sermons automatically without demanding prior knowledge from the user. The sermons are created using different characteristics of Torah sermons and dif...
Article
Keyphrases extracted from documents may save precious time for tasks such as filtering, summarization, and categorization. A few such systems are available for documents written in English. In this paper, we propose a model called LEH_KEY (Learning to Extract Hebrew KEYphrases) that for the first time learns to extract keyphrases for documents writ...
Conference Paper
Full-text available
Semitic language processing in general is of great interest today. However, the Hebrew and Aramaic languages have been relatively little studied. In this study, we investigate how to classify Jewish Law articles written in these languages according to the ethnic group of their authors. The motivation is to investigate the cultural differences in wr...
Conference Paper
Full-text available
Text Summarization is a research domain that attracts many research groups around the scientific world. It is the process of automatically creating a condensed version of a given text that provides useful information for the user. Semitic language processing in general is of great interest today. However, the Hebrew language has been relatively lit...
Conference Paper
Full-text available
Text classification is an important and challenging research domain. In this paper, identifying historical period and ethnic origin of documents using stylistic feature sets is investigated. The application domain is Jewish Law articles written in Hebrew-Aramaic. Such documents present various interesting problems for stylistic classification. Firs...
Article
Computer composition of high-quality chess mate problems is relatively an uninvestigated research domain. A previous model, called an Improver of Chess Problems, which is based on a hill-climbing search, improved slightly the quality of 10 out of 36 known problems (about 28%). In this article, we describe an improved model, called Deep Improver of...
Article
Full-text available
Computerized composers of chess mate problems are very rare. Moreover, they do not produce neither impressive nor creative new chess mate problems. In this paper, we describe a model called Chess Composer. This model uses a 64-bit representation, an ordered version of Iterative Deepening Depth First Search, and a quality function built with the hel...
Chapter
Full-text available
In this paper we describe game-independent strategies, capable of learning explanation patterns (XPs) for evaluation of any basic game pattern. A basic game pattern is defined as a minimal configuration of a small number of pieces and squares which describes only one salient game feature. Each basic pattern can be evaluated by a suitable XP. We hav...
Conference Paper
Full-text available
Many academic journals and conferences require that each article include a list of keyphrases. These keyphrases should provide general information about the contents and the topics of the article. Keyphrases may save precious time for tasks such as filtering, summarization, and categorization. In this paper, we investigate automatic extraction and...
Article
Full-text available
In the Hebrew language, many words have for each one, a few possible stems. However, for a given word in a context of a specific sentence in a specific paragraph in a specific document, each word has only one correct stem. We have developed seven baseline methods in order to find the correct stem for a given word. These methods use contexts, declen...
Article
Full-text available
Most documents do not include keyphrases. There are a few keyphrase extraction systems for documents written in English. However, there is no such a system for the Hebrew language. In this ongoing work, we investigate baseline methods that extract keyphrases from Hebrew news HTML documents. These methods have been tested on a set of documents. Each...
Article
Full-text available
Presented here is a natural and intuitively clear O(n**2) heuristics that offers a solution to the maximum matching problem in bipartite graphs. The heuristic is based on a greedy technique and uses global information about neighborhoods for choosing pairs of mertices. Assuming that the bipartite graph is represented by a 0-1 matrix in which rows r...
Article
Full-text available
Computer checkers programs achieve outstanding results at playing checkers. However, no existing program can either compose or improve adequate checkers compositions. In this paper, we present a model that is capable of improving the quality of a part of the existing checkers compositions. In this model we attempt to improve a given composition by...
Conference Paper
Full-text available
In many languages, abbreviations are widely used either in writing or talking. However, abbreviations are likely to be ambiguous. Therefore, there is a need for disambiguation. That is, abbreviations should be expanded correctly. Disambiguation of abbreviations is critical for correct understanding not only for the abbreviations themselves but also...
Conference Paper
Full-text available
The rapid increasing of online information is hard to handle. Summaries such as abstracts help us to reduce this problem. Keywords, which can be regarded as very short summaries, may help even more. Filtering documents by using keywords may save precious time while searching. However, most of the documents do not include keywords. In this paper we...
Article
Full-text available
The Judge's Apprentice is a case-based decision support system implemented and intended for use in Israeli criminal law to aid sentencing in cases of either robbery or rape. The system uses a sentencing tree, which is a hierarchical classification of 371 legal concepts relevant to criminal sentencing. Each leaf in this tree represents an index, whi...

Questions

Questions (25)
Question
Hello everyone,
Do you think that H-index is a reliable measure that well reflects the citations' level of the researcher's articles?
Question
Dear Colleagues,
I would appreciate sending links for datasets or lists of words in the Hindi and/or Marathi languages related to any of the following word sets:
  • coursings
  • abusive words
  • sexual words
  • hated words
  • profane words
  • treatment of women
  • first/second/third person pronouns
Thanks in advance,
Yaakov
Question
Dear Colleagues,
I would appreciate sending links for datasets or lists of words related to any of the following word sets:
  • coursings
  • abusive words
  • sexual words
  • hated words
  • profane words
  • treatment of women
  • first/second/third person pronouns
in the English and/or Hindi and/or Marathi languages.
Thanks in advance,
Yaakov

Network

Cited By