Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Accurately find or remove emojis from a blob of text using data from the Unicode Consortium's emoji code repository.
Version 1.x of demoji
now bundles Unicode data in the package at install time rather than requiring
a download of the codes from unicode.org at runtime. Please see the CHANGELOG.md
for detail and be familiar with the changes before updating from 0.x to 1.x.
To report any regressions, please open a GitHub issue.
demoji
exports several text-related functions for find-and-replace functionality with emojis:
>>> tweet = """\
... #startspreadingthenews yankees win great start by 🎅🏾 going 5strong innings with 5k’s🔥 🐂
... solo homerun 🌋🌋 with 2 solo homeruns and👹 3run homerun… 🤡 🚣🏼 👨🏽⚖️ with rbi’s … 🔥🔥
... 🇲🇽 and 🇳🇮 to close the game🔥🔥!!!….
... WHAT A GAME!!..
... """
>>> demoji.findall(tweet)
{
"🔥": "fire",
"🌋": "volcano",
"👨🏽\u200d⚖️": "man judge: medium skin tone",
"🎅🏾": "Santa Claus: medium-dark skin tone",
"🇲🇽": "flag: Mexico",
"👹": "ogre",
"🤡": "clown face",
"🇳🇮": "flag: Nicaragua",
"🚣🏼": "person rowing boat: medium-light skin tone",
"🐂": "ox",
}
See below for function API.
You can use demoji
or python -m demoji
to replace emojis
in file(s) or stdin with their :code:
equivalents:
$ cat out.txt
All done! ✨ 🍰 ✨
$ demoji out.txt
All done! :sparkles: :shortcake: :sparkles:
$ echo 'All done! ✨ 🍰 ✨' | demoji
All done! :sparkles: :shortcake: :sparkles:
$ demoji -
we didnt start the 🔥
we didnt start the :fire:
findall(string: str) -> Dict[str, str]
Find emojis within string
. Return a mapping of {emoji: description}
.
findall_list(string: str, desc: bool = True) -> List[str]
Find emojis within string
. Return a list (with possible duplicates).
If desc
is True, the list contains description codes. If desc
is False, the list contains emojis.
replace(string: str, repl: str = "") -> str
Replace emojis in string
with repl
.
replace_with_desc(string: str, sep: str = ":") -> str
Replace emojis in string
with their description codes. The codes are surrounded by sep
.
last_downloaded_timestamp() -> datetime.datetime
Show the timestamp of last download for the emoji data bundled with the package.
Numerous emojis that look like single Unicode characters are actually multi-character sequences. Examples:
b'\\U0001f3f4\\U000e0067\\U000e0062\\U000e0073\\U000e0063\\U000e0074\\U000e007f'
in full esaped notation.(You can see any of these through s.encode("unicode-escape")
.)
demoji
is careful to handle this and should find the full sequences rather than their incomplete subcomponents.
The way it does this it to sort emoji codes by their length, and then compile a concatenated regular expression that will greedily search for longer emojis first, falling back to shorter ones if not found. This is not by any means a super-optimized way of searching as it has O(N2) properties, but the focus is on accuracy and completeness.
>>> from pprint import pprint
>>> seq = """\
... I bet you didn't know that 🙋, 🙋♂️, and 🙋♀️ are three different emojis.
... """
>>> pprint(seq.encode('unicode-escape')) # Python 3
(b"I bet you didn't know that \\U0001f64b, \\U0001f64b\\u200d\\u2642\\ufe0f,"
b' and \\U0001f64b\\u200d\\u2640\\ufe0f are three different emojis.\\n')
__main.py__
to allow running python -m demoji
;
add an entry-point demoji
command;
permit stdin (-
), file name(s), or piped stdin.
Contribution by @jap.This is a backwards-incompatible release with several substantial changes.
The largest change is that demoji
now bundles a static copy of Unicode
emoji data with the package at install time, rather than requiring a runtime
download of the codes from unicode.org.
Changes below are grouped by their corresponding Semantic Versioning identifier.
SemVer MAJOR:
demoji
package now bundles emoji data that is distributed with the
package at install time, rather than requiring a download of the codes
from the unicode.org site at runtime (closes #23)demoji
API:
download_codes()
parse_unicode_sequence()
parse_unicode_range()
stream_unicodeorg_emojifile()
SemVer MINOR:
demoji.DIRECTORY
and demoji.CACHEPATH
attributes are deprecated
due to no longer being functionally in used by the package. Accessing them
will warn with a FutureWarning
, and these attributes may be removed
completely in a future releasedemoji
can now be installed with optional ujson
support for faster loading
of emoji data from file (versus the standard library's json
, which is the
default); use python -m pip install demoji[ujson]
requests
and colorama
have been removed completelyimportlib_resources
(a backport module) is now required for Python < 3.7EMOJI_VERSION
attribute, newly added to demoji
, is a str
denoting
the Unicode database version in useSemVer PATCH:
demoji.__all__
to properly include demoji.findall_list()
set_emoji_pattern()
are now decorated
with a @cache_setter
to set the cachedemoji.last_downloaded_timestamp()
returns correct UTC time.
(See 6c8ad15.)findall_list()
and replace_with_desc()
functions. (See 7cea333.)setup.cfg
. (See 8f141e7.)setup.py
that would require dependencies to be installed
prior to installation of demoji
in order to find the __version__
.
(See d5f429c.)io.open(..., encoding='utf-8')
consistently in setup.py
.
(See 1efec5d.)re.escape()
rather than failing to compile a small subset of codes.__init__.py
.FAQs
Accurately remove and replace emojis in text strings
We found that demoji demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.