Title: | All Stop Words in One Place |
---|---|
Description: | A standalone package combining several stop-word lists for 65 languages with a median of 329 stop words for language and over 1,000 entries for English, Breton, Latin, Slovenian, and Ancient Greek! The user automatically gets access to all the unique stop words contained in: the 'StopwordISO' repository; python's 'Natural Language Toolkit'; the 'Snowball' stop-word list; the R package 'quanteda'; the 'marimo' repository; the 'Perseus' project; and A. Berra's list of stop words for Ancient Greek and Latin. |
Authors: | Fabio Ashtar Telarico [aut, cre]
|
Maintainer: | Fabio Ashtar Telarico <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.0 |
Built: | 2025-03-08 02:59:31 UTC |
Source: | https://github.com/fatelarico/morestopwords |
See the relevant Wikipedia article for details on the language codes.
languages(available = TRUE)
languages(available = TRUE)
available |
logical, whether to return only the languages supported in this package. |
Note that:
the ISO 639-1 code for mainland Chinese was changed to zh-cn
.
A list of stop words in the variety of Chinese spoken in the island of Taiwan is accessible using the ISO 639-1 zh-tw
or the name 'Chinese Taiwan'
.
Ancient Greek has been assigned an artifact ISO 639-1 code (gr
) because it had none. Its ISO 639-2 and 639-3 codes are both grc
.
A data frame with a row for each languages (only those supported if available
is TRUE
) and columns for the several ISO codes (639-2, 639-3, 639-1) and the name.
# Return all languages in the ISO 639-2/3 standard languages()
# Return all languages in the ISO 639-2/3 standard languages()
Removes stop words for a string the language of which is known
remove.stopwords(str, lang = "auto", fallback = "English")
remove.stopwords(str, lang = "auto", fallback = "English")
str |
A string or a vector of strings which to delete the stop words from |
lang |
Either:
|
fallback |
Fallback language in case |
A strings (or a vector, depending on str
) corresponding to the string/s str
without stop words for the language/s lang
.
# Multiple strings in different languages remove.stopwords(str = c(Gibberish = 'dadas', Catalan = 'Adeu amic meu', Irish = 'Slan a chara', French = 'Je suis en Allemagne', German = 'Eich liebe Deutschland'), # Various ways of indicating the language lang = c(NA, 'cata', 'Iris', 'fr', 'deu'), # Yet another way fallback = 'english' )
# Multiple strings in different languages remove.stopwords(str = c(Gibberish = 'dadas', Catalan = 'Adeu amic meu', Irish = 'Slan a chara', French = 'Je suis en Allemagne', German = 'Eich liebe Deutschland'), # Various ways of indicating the language lang = c(NA, 'cata', 'Iris', 'fr', 'deu'), # Yet another way fallback = 'english' )
This function returns stop words contained in the StopwordsISO repository.
stopwords(lang = "en")
stopwords(lang = "en")
lang |
Language for which to retrieve the stop word among those supported by StopwordISO. This parameters supports:
|
A character vector containing the stop words from the selected language as listed in the StopwordISO repository.
# They all return the correct list of stop words! stopwords('German') stopwords('germ') stopwords('de') stopwords('deu')
# They all return the correct list of stop words! stopwords('German') stopwords('germ') stopwords('de') stopwords('deu')
A list of stop words in each of the supported languages
stopwordsISO
stopwordsISO
An object of class list
of length 65.
Note: All unicode characters are escaped. To un-escape them, consider using:
library(AllStopwords) if(!requireNamespace('stringi')){ install.packages('stringi') } data('stopwordsISO') stopwords_unescaped <- lapply(stopwordsISO, stringi::stri_unescape_unicode)
Each stop-word list's Authors
All unique stopwords in the following databases:
the StopwordISO repository;
python's Natural Language Toolkit (nltk);
the http://snowball.tartarus.org/algorithms/english/stop.txt stop-word list;
the R package quanteda;
the marimo repository;
the Perseus project; and
Aurélien Berra's list of stop words for Ancient Greek and Latin (doi:10.5281/zenodo.3860343).