Package 'morestopwords'

Title: All Stop Words in One Place
Description: A standalone package combining several stop-word lists for 65 languages with a median of 329 stop words for language and over 1,000 entries for English, Breton, Latin, Slovenian, and Ancient Greek! The user automatically gets access to all the unique stop words contained in: the 'StopwordISO' repository; python's 'Natural Language Toolkit'; the 'Snowball' stop-word list; the R package 'quanteda'; the 'marimo' repository; the 'Perseus' project; and A. Berra's list of stop words for Ancient Greek and Latin.
Authors: Fabio Ashtar Telarico [aut, cre] , Kohei Watanabe [aut]
Maintainer: Fabio Ashtar Telarico <[email protected]>
License: MIT + file LICENSE
Version: 0.2.0
Built: 2025-03-08 02:59:31 UTC
Source: https://github.com/fatelarico/morestopwords

Help Index


Returns ISO codes and names for all language or only those available in this package

Description

See the relevant Wikipedia article for details on the language codes.

Usage

languages(available = TRUE)

Arguments

available

logical, whether to return only the languages supported in this package.

Details

Note that:

  • the ISO 639-1 code for mainland Chinese was changed to zh-cn.

  • A list of stop words in the variety of Chinese spoken in the island of Taiwan is accessible using the ISO 639-1 zh-tw or the name 'Chinese Taiwan'.

  • Ancient Greek has been assigned an artifact ISO 639-1 code (gr) because it had none. Its ISO 639-2 and 639-3 codes are both grc.

Value

A data frame with a row for each languages (only those supported if available is TRUE) and columns for the several ISO codes (639-2, 639-3, 639-1) and the name.

Examples

# Return all languages in the ISO 639-2/3 standard
languages()

Removes stop words for a string the language of which is known

Description

Removes stop words for a string the language of which is known

Usage

remove.stopwords(str, lang = "auto", fallback = "English")

Arguments

str

A string or a vector of strings which to delete the stop words from

lang

Either:

  • 'auto' in which case cld2 is used to perform language detection; or

  • A string (or a vector of strings, depending on str) representing an ISO 639-2/3 or a language name from which to derive a ISO 639-2 code (for language names, string matching is performed)

fallback

Fallback language in case cld2 fails to detect the language of the manually-specified string does not match a supported language. Default to 'English'.

Value

A strings (or a vector, depending on str) corresponding to the string/s str without stop words for the language/s lang.

Examples

# Multiple strings in different languages
remove.stopwords(str = c(Gibberish = 'dadas',
                         Catalan = 'Adeu amic meu',
                         Irish = 'Slan a chara',
                         French = 'Je suis en Allemagne',
                         German = 'Eich liebe Deutschland'),
                 # Various ways of indicating the language
                 lang = c(NA, 'cata', 'Iris', 'fr', 'deu'),
                 # Yet another way
                 fallback = 'english'
                 )

Collection of stopwords in multiple languages

Description

This function returns stop words contained in the StopwordsISO repository.

Usage

stopwords(lang = "en")

Arguments

lang

Language for which to retrieve the stop word among those supported by StopwordISO. This parameters supports:

  • three-letter ISO 639-2/3 codes (e.g., 'eng');

  • two-letter ISO639-1 codes ('en');

  • names based ISO 639-2 codes ('English' or 'english') and their unambiguous substrings ('engl', 'engli', etc.).

Value

A character vector containing the stop words from the selected language as listed in the StopwordISO repository.

Examples

# They all return the correct list of stop words!

stopwords('German')
stopwords('germ')
stopwords('de')
stopwords('deu')

Combined stop words for all languages

Description

A list of stop words in each of the supported languages

Usage

stopwordsISO

Format

An object of class list of length 65.

Details

Note: All unicode characters are escaped. To un-escape them, consider using:

   library(AllStopwords)
   if(!requireNamespace('stringi')){
     install.packages('stringi')
   }
   data('stopwordsISO')
   stopwords_unescaped <- lapply(stopwordsISO,
                                 stringi::stri_unescape_unicode)

Author(s)

Each stop-word list's Authors

Source

All unique stopwords in the following databases: