reliq-python
A python module for reliq library.
Requirements
Installation
pip install reliq
Import
from reliq import reliq
Usage
Code
from reliq import reliq
html = ""
with open('index.html','r') as f:
html = f.read()
rq = reliq(html)
expr = reliq.expr(r"""
div .user; {
a href; {
.name @ | "%i",
.link @ | "%(href)v"
},
.score.u span .score,
.info dl; {
.key dt | "%i",
.value dd | "%i"
} |,
.achievements.a li class=b>"achievement-" | "%i\n"
}
""")
users = []
links = []
for i in rq.filter(r'table; { tr, text@ iw>lisp }')[:-2]:
if i.type is not reliq.Type.tag:
continue
first_child = i[0]
if first_child.child_count < 3 and first_child.name == "div" and first_child.starttag == '<div>':
continue
link = first_child[2].attribs['href']
if re.match('^https://$',link):
links.append(link)
continue
for j in i.ancestors():
if j.name == "main":
break
else:
continue
user = json.loads(i.search(expr))
users.append(user)
try:
reliq.search('p / /','<p></p>')
except reliq.ScriptError:
print("error")
print(rq[2].text_recursive)
print(rq[2].text)
reliq.decode('loop & < ⃛ Ô')
rq.json(r"""
.files * #files; ( li )( span .head ); {
.type i class child@ | "%(class)v" / sed "s/^flaticon-//",
.name @ | "%Dt" / trim sed "s/ ([^)]* [a-zA-Z][Bb])$//",
.size @ | "%t" / sed 's/.* \(([^)]* [a-zA-Z][Bb])\)$/\1/; s/,//g; /^[0-9].* [a-zA-Z][bB]$/!d' "E"
} |
""")
Import
Everything is contained inside one class
from reliq import reliq
Initialization
reliq
object takes a single argument representing html, this can be str()
, bytes()
, Path()
(file is read as bytes
), reliq()
or None
.
rq = reliq('<p>Example</p>')
rq2 = reliq(Path('index.html'))
rq3 = reliq(None)
rq4 = reliq()
Types
reliq
can have 5 types that change the behaviour of methods.
Calling type
property on object e.g. rq.type
returns instance of reliq.Type(Flag)
.
empty
Gets returned from either reliq(None)
or reliq.filter()
that matches nothing, makes all methods return default values.
unknown
Similar to empty
but should never happen
struct
Returned by successful initialization e.g.
reliq('<p>Example</p>')
list
Returned by reliq.filter()
that succeeds
single
Returned by axis methods or by accessing the object like a list.
The type itself is a grouping of more specific types:
- tag
- comment
- textempty (text made only of whitespaces)
- texterr (text where an html error occured)
- text
- textall (grouping of text types)
get_data(raw=False) -> str|bytes
Returns the same html from which the object was compiled.
If first argument is True
or raw=True
returns bytes
.
data = Path('index.html').read_bytes
rq = reliq(data)
x = rq[0][2][1][8]
x.get_data(True) is data
special methods
bytes and str
Full string representation of current object
rq = reliq("""
<h1><b>H</b>1</h1>
<h2>N2</h2>
<h2>N3</h2>
""")
str(rq)
str(rq.filter('h2'))
str(rq[0])
str(reliq())
getitem
For single
indexes results from children()
axis, otherwise from self()
axis
rq = reliq('<div><p>1</p> Text <b>H</b></div>')
first = rq[0]
first[1]
r = first.filter('( text@ * )( * ) child@')
r[1]
r[2] == first[1]
len
Amount of objects returned from __getitem__
properties of single
Calling these properties for types other than single
returns their default values.
lvl -> int
level in html structure
rlvl -> int
level in html structure, relative to parent
position -> int
position in html structure
rposition -> int
position in html structure, relative to parent
Calling some properties makes sense only for certain types.
tag
tag_count -> int
count of tags
text_count -> int
count of text
comment_count -> int
count of comments
desc_count -> int
count of descendants
attribsl -> int
number of attributes
attribs -> dict
dictionary of attributes
These return None
only if called from empty
type. They also have _raw
counterparts that return bytes
e.g. text_recursive_raw -> Optional[bytes]
, name_raw -> Optional[bytes]
insides -> Optional[str]
string containing contents inside tag or comment
name -> Optional[str]
tag name e.g. 'div'
starttag -> Optional[str]
head of the tag e.g. '<div class="user">'
endtag -> Optional[str]
tail of the tag e.g. '</div>'
endtag_strip -> Optional[str]
tail of the tag, stripped of <
and >
e.g. '/div'
text -> Optional[str]
text of children
text_recursive -> Optional[str]
text of descendants
rq = reliq("""
<main>
<ul>
<a>
<li>L1</li>
</a>
<li>L2</li>
</ul>
</main>
""")
ul = rq[0][0]
a = ul[0]
li1 = a[0]
li2 = ul[1]
ul.name
ul.name_raw
ul.lvl
li1.lvl
ul.text
ul.text_recursive
a.insides
Comments can either return their string representation or insides by insides
property.
c = reliq('<!-- Comment -->').self(type=None)[0]
c.insides
bytes(c)
str(c)
text
Text can only be converted to string
t = reliq('Example').self(type=None)[0]
str(t)
axes
Convert reliq
objects into a list or a generator of single
type objects.
If their first argument is set to True
or gen=True
is passed, a generator is returned, otherwise a list.
By default they filter node types to only reliq.Type.tag
, this can be changed by setting the type
argument e.g. type=reliq.Type.comment|reliq.Type.texterr
. If type is set to None
all types are matched.
If rel=True
is passed returned objects will be relative to object from which they were matched.
rq = reliq("""
<!DOCTYPE html>
<head>
<title>Title</title>
</head>
<body>
<section>
<h1>Title</h1>
<p>A</p>
</section>
<h2>List</h2>
<ul>
<li>A</li>
<li>B</li>
<li>C</li>
</ul>
<section>
TEXT
</section>
</body>
""")
everything
everything()
gets all elements in structure, no matter the previous context.
for i in rq.everything(True):
print(str(i))
self
self()
gets the context itself, single element for single
type, list of the list
type and elements with .lvl == 0
for struct
type.
By default filtered type depends on object type it was called for, for single
and list
types are unfiltered, only struct
type enforces type=reliq.Type.tag
.
rq.self()
rq.self(type=None)
rq.self(type=reliq.Type.tag|reliq.Type.comment)
ls = rq.filter('[:3] ( comment@ * )( text@ * )')
ls.self()
ls.self(type=reliq.Type.tag|reliq.Type.comment)
body = rq[1].self()
len(body.self())
body.self()[0].name
children
children()
gets all nodes of the context that have level relative to them equal to 1.
rq.children()
rq.filter('head, ul').children()
first_section = rq[1][0]
first_section.children()
descendants
descendants()
gets all nodes of the context that have level relative to them greater or equal to 1.
rq.descendants()
rq.filter('[0] section').descendants()
rq[1][0].descendants()
full
full()
gets all nodes of the context and all nodes below them (like calling self()
and descendants()
at once).
rq.full()
rq.filter('[0] section').descendants()
rq[1][0].descendants()
parent
parent()
gets parent of context nodes. Doesn't work for struct
type.
rq.filter('li').parent()
rq[1][2][0].parent()
rq[0].parent()
rparent
rparent()
behaves like parent()
but returns the parent to which the current object is relative to. Doesn't work for struct
type.
It doesn't take rel
argument, returned objects are always relative.
ancestors
ancestors()
gets ancestors of context nodes. Doesn't work for struct
type.
rq.filter('li').ancestors()
rq[1][2][0].ancestors()
rq[1][2][0].ancestors()[0].name == rq[1][2][0].parent()[0].name
rq[0].ancestors()
before
before()
gets all nodes that have lower .position
property than context nodes. Doesn't work for struct
type.
rq.filter('[0] title, [1] section').before()
title = rq[0][0]
title.before()
second_section = rq[1][3]
second_section.before()
head = rq[0]
head.before()
preceding
preceding()
is similar to before()
but ignores ancestors. Doesn't work for struct
type.
rq.filter('[0] title, [1] section').preceding()
title = rq[0][0]
title.preceding()
second_section = rq[1][3]
second_section.preceding()
after
after()
gets all nodes that have higher .position
property than context nodes. Doesn't work for struct
type.
rq.filter('h2, ul').after()
h2 = rq[1][1]
h2.after()
ul = rq[1][2]
ul.after()
third_section = rq[1][3]
third_section.after()
subsequent
subsequent()
is similar to after()
but ignores descendants. Doesn't work for struct
type.
rq.filter('h2, ul').subsequent()
h2 = rq[1][1]
h2.subsequent()
ul = rq[1][2]
ul.subsequent()
siblings_preceding
siblings_preceding()
gets nodes on the same level as context nodes but before them and limited to their parent. Doesn't work for struct
type.
If full=True
is passed descendants of siblings will also be matched.
rq.filter('ul, h2').siblings_preceding()
h2 = rq[1][1]
h2.siblings_preceding()
h2.siblings_preceding(full=True)
ul = rq[1][2]
ul.siblings_preceding()
ul.siblings_preceding(full=True)
siblings_subsequent
siblings_preceding()
gets nodes on the same level as context nodes but after them and limited to their parent. Doesn't work for struct
type.
If full=True
is passed descendants of siblings will also be matched.
rq.filter('ul, h2').siblings_subsequent()
h2 = rq[1][1]
h2.siblings_subsequent()
h2.siblings_subsequent(full=True)
ul = rq[1][2]
ul.siblings_subsequent()
ul.siblings_subsequent(full=True)
siblings
siblings()
returns merged output of siblings_preceding()
and siblings_subsequent()
.
expr
reliq.expr
is a class that compiles expressions, it accepts only one argument that can be a str()
, bytes()
or Path()
.
If Path()
argument is specified, file under it will be read with Path.read_bytes()
.
reliq.expr(r'table; { tr .name; li | "%(title)v\n", th }')
reliq.expr(rb'li')
reliq.expr(Path('expression.reliq'))
search
search()
executes expression in the first argument and returns str()
or bytes
if second argument is True
or raw=True
.
Expression can be passed both as compiled object of reliq.expr
or its representation in str()
, bytes()
or Path()
that will be compiled in function.
rq = reliq('<span class=name data-user-id=1282>User</span><p>Title: N1ase</p>')
rq.search(r'p')
rq.search(r'p', True)
rq.search(r'p', raw=True)
rq.search(r"""
span .name; {
.id.u @ | "%(data-user-id)v",
.name @ | "%t"
},
.title p | "%i" sed "s/^Title: //"
""",True)
rq.search(Path('expression.reliq'))
json
Same as search()
but returns dict()
.
filter
filter()
executes expression in the first argument and returns reliq
object of list
type or empty
type if nothing has been found.
If second argument is True
or independent=True
then returned object will be completely independent from the one the function was called on. A new HTML string representation will be created, and structure will be copied and shifted to new string, levels will also change.
Expression can be passed both as compiled object of reliq.expr
or its representation in str()
, bytes()
or Path()
that will be compiled in function.
Any field, formatting or string conversion in expression will be ignored, only objects used in them will be returned.
rq = reliq('<span class=name data-user-id=1282>User</span><p>Title: N1ase</p>')
rq.filter(r'p').self()
rq.filter(r'p').type
rq.filter(r'p').get_data()
rq.filter(r'p',True).get_data()
rq.filter(r'nothing').type
rq.filter(r"""
span .name; {
.id.u @ | "%(data-user-id)v",
.name @ | "%t"
},
.title p | "%i" sed "s/^Title: //"
""")
rq.filter(Path('expression.reliq'))
Encoding and decoding html entities
decode()
decodes html entities in first argument of str()
or bytes()
, and returns str()
or bytes()
if second argument is True
or raw=True
.
By default
is translated to space, this can be changed by setting no_nbsp=False
.
encode()
does the opposite of decode()
in the same fashion.
By default only special characters are encoded i.e. <
, >
, "
, '
, &
. If full=True
is set everything possible will be converted to html entities (quite slow approach).
reliq.decode(r"text & < ⃛ Ô")
reliq.decode(r"text & < ⃛ Ô",True)
reliq.decode(r"text & < ⃛ Ô",raw=True)
reliq.decode('ex t')
reliq.decode('ex t',no_nbsp=False)
reliq.decode('ex t',True,no_nbsp=False)
reliq.encode("<p>li & \t 'seq' \n </p>")
reliq.encode("<p>li & \t 'seq' \n </p>",True)
reliq.encode("<p>li & \t 'seq' \n </p>",raw=True)
reliq.encode("<p>li & \t 'seq' \n </p>",full=True)
reliq.encode("<p>li & \t 'seq' \n </p>",True,full=True)
Errors
All errors are instances of reliq.Error
.
reliq.SystemError
is raised when kernel fails (you should assume it doesn't happen).
reliq.HtmlError
is raised when html structure exceeds limits.
reliq.ScriptError
is raised when incorrect script is compiled.
try:
reliq('<div>'*8193)
except reliq.HtmlError:
print('html depth limit exceeded')
try:
reliq.expr('| |')
except reliq.ScriptError:
print('incorrect expression')
Relativity
list
and single
type object also stores a pointer to node that object is relative to in context i.e. rq.filter(r'body; nav')
will return nav
objects that were found in body
tags, nav
objects might not be direct siblings of body
tags but because of relativity their relation is not lost.
reliq.filter()
always keeps the relativity.
By default axis functions don't change relativity unless rel=True
is passed.
rq = reliq("""
<body>
<nav>
<ul>
<li> A </li>
<li> B </li>
<li> C </li>
</ul>
</nav>
</body>
""")
li = rq[0][0][0][1]
li_self = rq.filter('li i@w>"B"')[0]
li_rel = rq.filter('nav; li i@w>"B"')[0]
li.lvl
li_rel.lvl
li.rlvl
li_rel.rlvl
li.position
li_rel.position
li.rposition
li_rel.rposition
nav = rq[0][0]
for i in nav.descendants(rel=True):
if i.rlvl == 2 and i.name == 'li':
print(i.lvl,i.rlvl)
break
nav_rel = li_rel.rparent()[0]
nav_rel.rlvl
nav_rel.rposition
Projects using reliq in python