Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
This module provides Python bindings to the tree-sitter parsing library.
This package currently only works with Python 3. There are no library dependencies, the package is distributed as a CPython-compiled wheel.
pip3 install abch-tree-sitter
This is a fork of py-tree-sitter by Max Brunsfeld with UTF-16 support added and distributed as CPython-compiled wheels.
First you'll need a Tree-sitter language implementation for each language that you want to parse. You can clone some of the existing language repos or create your own:
git clone https://github.com/tree-sitter/tree-sitter-go
git clone https://github.com/tree-sitter/tree-sitter-javascript
git clone https://github.com/tree-sitter/tree-sitter-python
Use the Language.build_library
method to compile these into a library that's usable from Python. This function will return immediately if the library has already been compiled since the last time its source code was modified:
from tree_sitter import Language, Parser
Language.build_library(
# Store the library in the `build` directory
'build/my-languages.so',
# Include one or more languages
[
'vendor/tree-sitter-go',
'vendor/tree-sitter-javascript',
'vendor/tree-sitter-python'
]
)
Load the languages into your app as Language
objects:
GO_LANGUAGE = Language('build/my-languages.so', 'go')
JS_LANGUAGE = Language('build/my-languages.so', 'javascript')
PY_LANGUAGE = Language('build/my-languages.so', 'python')
Create a Parser
and configure it to use one of the languages:
parser = Parser()
parser.set_language(PY_LANGUAGE)
Parse some source code:
tree = parser.parse(bytes("""
def foo():
if bar:
baz()
""", "utf8"))
If you have your source code in some data structure other than a bytes object, you can pass a "read" callable to the parse function.
The read callable can use either the byte offset or point tuple to read from buffer and return source code as bytes object. An empty bytes object or None terminates parsing for that line. The default encoding is utf8.
For example, to use the byte offset:
src = bytes("""
def foo():
if bar:
baz()
""", "utf8")
def read_callable(byte_offset, point):
return src[byte_offset:byte_offset+1]
tree = parser.parse(read_callable)
And to use the point:
src_lines = ["def foo():\n", " if bar:\n", " baz()"]
def read_callable(byte_offset, point):
row, column = point
if row >= len(src_lines) or column >= len(src_lines[row]):
return None
return src_lines[row][column:].encode('utf8')
tree = parser.parse(read_callable)
Or with utf16 encoding:
tree = parser.parse(bytes("""
def foo():
if bar:
baz()
""", "utf16"), encoding="utf16")
src = bytes("""
def foo():
if bar:
baz()
""", "utf16")
def read_callable(byte_offset, point):
return src[byte_offset:byte_offset+2]
tree = parser.parse(read_callable, encoding="utf16")
src_lines = ["def foo():\n", " if bar:\n", " baz()"]
def read_callable(byte_offset, point):
row, column = point
if row >= len(src_lines) or column >= len(src_lines[row].encode("utf-16-le")):
return None
ret = src_lines[row].encode("utf-16-le")[column:]
return ret
tree = parser.parse(read_callable, encoding="utf16")
Inspect the resulting Tree
:
root_node = tree.root_node
assert root_node.type == 'module'
assert root_node.start_point == (1, 0)
assert root_node.end_point == (3, 13)
function_node = root_node.children[0]
assert function_node.type == 'function_definition'
assert function_node.child_by_field_name('name').type == 'identifier'
function_name_node = function_node.children[1]
assert function_name_node.type == 'identifier'
assert function_name_node.start_point == (1, 4)
assert function_name_node.end_point == (1, 7)
assert root_node.sexp() == "(module "
"(function_definition "
"name: (identifier) "
"parameters: (parameters) "
"body: (block "
"(if_statement "
"condition: (identifier) "
"consequence: (block "
"(expression_statement (call "
"function: (identifier) "
"arguments: (argument_list))))))))"
If you need to traverse a large number of nodes efficiently, you can use
a TreeCursor
:
cursor = tree.walk()
assert cursor.node.type == 'module'
assert cursor.goto_first_child()
assert cursor.node.type == 'function_definition'
assert cursor.goto_first_child()
assert cursor.node.type == 'def'
# Returns `False` because the `def` node has no children
assert not cursor.goto_first_child()
assert cursor.goto_next_sibling()
assert cursor.node.type == 'identifier'
assert cursor.goto_next_sibling()
assert cursor.node.type == 'parameters'
assert cursor.goto_parent()
assert cursor.node.type == 'function_definition'
When a source file is edited, you can edit the syntax tree to keep it in sync with the source:
tree.edit(
start_byte=5,
old_end_byte=5,
new_end_byte=5 + 2,
start_point=(0, 5),
old_end_point=(0, 5),
new_end_point=(0, 5 + 2),
)
Then, when you're ready to incorporate the changes into a new syntax tree,
you can call Parser.parse
again, but pass in the old tree:
new_tree = parser.parse(new_source, tree)
This will run much faster than if you were parsing from scratch.
The Tree.get_changed_ranges
method can be called on the old tree to return
the list of ranges whose syntactic structure has been changed:
for changed_range in tree.get_changed_ranges(new_tree):
print('Changed range:')
print(f' Start point {changed_range.start_point}')
print(f' Start byte {changed_range.start_byte}')
print(f' End point {changed_range.end_point}')
print(f' End byte {changed_range.end_byte}')
You can search for patterns in a syntax tree using a tree query:
query = PY_LANGUAGE.query("""
(function_definition
name: (identifier) @function.def)
(call
function: (identifier) @function.call)
""")
captures = query.captures(tree.root_node)
assert len(captures) == 2
assert captures[0][0] == function_name_node
assert captures[0][1] == "function.def"
The Query.captures()
method takes optional start_point
, end_point
,
start_byte
and end_byte
keyword arguments which can be used to restrict the
query's range. Only one of the ..._byte
or ..._point
pairs need to be given
to restrict the range. If all are omitted, the entire range of the passed node
is used.
FAQs
Python bindings to the Tree-sitter parsing library
We found that abch-tree-sitter demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.