Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
This package takes subtitle VTT files (Video Text Track files) and extracts the piece of news from the whole newscast inside the file. News are stored into a Tree structure with useful NLP features inside. The user can specify their own algorithm for segmentation, however, there are some default.
Author: A.Palomo-Alonso (a.palomo@uah.es)
Contributors: D.Casillas-Pérez, S.Jiménez-Fernández, A.Portilla-Figueras, S.Salcedo-Sanz.
Universidad de Alcalá.
Escuela Politécnica Superior.
Departamento de Teoría De la Señal y Comunicaciones (TDSC).
Cátedra ISDEFE.
NewSegmentation
abstract class for custom algorithms.Segmentation
class for default modules.plot_matrix()
method for the matrix generated.where_is()
method for finding pieces of news.gtreader()
for reading reference trees for evaluation in specific format.Tree
and Leaf
structures.PBMM
and FB-BCM
algorithms.TDM
, DBM
, SDM
implemented.GPA
implemented inside SDM
.save()
implemented in Segmentation class.load3s
implemented for reading trees from files.Segmentation.evaluate()
now can take a path as a parameter!Segmentation.evaluate(integrity_validation=True)
now takes as default parameter integrity_validation=True
for integrity validation.NOTE: If your custom algorithm removes sentences from the original text, you should call
integrity_validation=False
as it checks every that every sentence is in each tree.
Segmentation
class taking a cache file as a new parameter:
Segmentation(cache_file='./myjson.json')
. This speeds up the architecture when the sentences sent to the LCM
are the same.
For instance, when testing parameters in the same database the process is around 1000% faster.'.'
not inserted when constructing payload from leafs..TXT
input files and cache.evaluate(<gt>, show=True)
where the correct segmentation and the performed segmentation switches places
in the plot representation.Segmentation
class./experiments
folder./tests
folder.The whole architecture and algorithms are described in depht in this paper or in
this master thesis.
The architecture takes advantage of three main features in order perform news segmentation:
This architecture works with a correlation matrix formed by the semantic correlation between each sentence in the news broadcast. Each module modifies the correlation matrix in order to apply temporal / spatial features reflected in the matrix. The algorithms shall be able to identify each piece of news inside the matrix. Three differentiated modules make up the architecture:
The user can implement their own algorithms depending on their application.\
The results are stored into a Tree structure with different fields representing different features from the piece of news.
First, install the python package. After this, you can use your VTT
files to get the
news. Any other type of file can be considered, but the user must implement their own database
transformer according to the file and language used. Spanish news segmentation is the default model.
You can install the package via pip:
pip install newsegmentation -r requirements.txt
If any error occurred, try installing the requirements before the installation:
numpy
matplotlib
googletrans == 4.0.0rc1
sentence_transformers >= 2.2.0
sklearn
nltk
In this demo, we extract the news inside the first 5 minutes of the VTT
file:
$ python
>>> import newsegmentation as ns
>>> myNews = ns.Segmentation(r'./1.vtt')
>>> print(myNews)
NewsSegmentation object: 8 news classified.
>>> myNews.info()
News segmentation package:
--------------------------------------------
FAST USAGE:
--------------------------------------------
PATH_TO_MY_FILE = <PAHT>
import newsegmentation as ns
news = ns.NewsSegmentation(PATH_TO_MY_FILE)
for pon in news:
print(pon)
--------------------------------------------
>>> myNews.about()
Institution:
------------------------------------------------------
Universidad de Alcalá.
Escuela Politécnica Superior.
Departamento de Teoría De la Señal y Comunicaciones.
Cátedra ISDEFE.
------------------------------------------------------
Author: Alberto Palomo Alonso
------------------------------------------------------
>>> for pieceOfNews in myNews:
>>> print(pieceOfNews)
No hay descanso. Desde hace más de 24 horas se trabaja sin tregua para encontrar a Julen. El niño de 2 años se cayó en un pozo en Totalán, en Málaga. Las horas pasan, los equipos de rescate luchan contrarreloj y buscan nuevas opciones en un terreno escarpado y con riesgo de derrumbes bajo tierra. Buenas noches. Arrancamos este Telediario, allí, en el lugar del rescate. ¿Cuáles son las opciones para encontrar a Julen? Se trabaja en 3 frentes retirar la arena que está taponando el pozo de prospección. Excavar en 2 pozo, y abrir en el lateral de la montaña
El objetivo rescatar al pequeño. El proyecto de presupuestos llega al Congreso. Son las cuentas con más gasto público desde 2010 Destacan más partidas para programas sociales, contra la pobreza infantil o la dependencia, y también el aumento de inversiones en Cataluña. El gobierno necesita entre otros el apoyo de los independentistas catalanes que por ahora mantienen el NO a los presupuestos, aunque desde el ejecutivo nacional se escuchan voces más optimistas
La familia de Laura Sanz Nombela, fallecida en París por una explosión de gas espera poder repatriar su cuerpo este próximo miércoles. Hemos hablado con su padre, que está en Francia junto a su yerno y nos ha contado que se sintieron abandonados en las primeras horas tras el accidente. La guardia civil busca en una zona de grutas volcánicas de difícil acceso el cuerpo de la joven desaparecida en Lanzarote, Romina Celeste. Su marido está detenido en relación con su muerte aunque él defiende que no la mató, que solo discutieron y que luego se la encontró muerta la noche de Año Nuevo
Dormir poco hace que suba hasta un 27 por ciento el riesgo de enfermedades cardiovasculares
Es la conclusión de un estudio que ha realizado durante 10 años el Centro Nacional para estas dolencias
Y una noticia de esta misma tarde de la que estamos muy pendientes: Un tren ha descarrilado esta tarde cerca de Torrijos en Toledo sin causar heridos. Había salido de Cáceres con dirección a Madrid. Los 33 pasajeros han sido trasladados a la capital en otro tren. La circulación en la vía entre Madrid y Extremadura está interrumpida. Renfe ha organizado un transporte alternativo en autobús para los afectados
A 15 días de la gran gala de los Goya hoy se ha entregado ya el primer premio. La cita es el próximo 2 de febrero en Sevilla, pero hoy, aquí en Madrid, en el Teatro Real gran fiesta de los denominados a los Premios Goya. Solo uno de ellos se llevará hoy su estatuilla. Chicho Ibáñez Serrador consigue el Premio Goya de Honor por toda una vida dedicada al cine de terror
Y en los deportes Nadal gana en Australia, Sergio
>>> myNews.plotmtx()
You can also find information inside the news using the method whereis()
:
>>> myNews.whereis('Nadal')
[7]
>>> myNews.whereis('2')
[0, 1, 3, 6]
If you can create a tree from any ground truth database, this package also has a method por evaluation:
First, you have to import a custom ground truth / golden data tree with gtreader()
:
>>> from newsegmentation import gtreader
>>> myGt = gtreader('path.txt')
Then evaluate the news with the reference, use the argument evaluate(ref, show=True)
to plot some graphics about the evaluation:
>>> myNews.evaluate(myGt, show=True)
This package defines a data structure called news trees, this format is parsed by the code via parsers:
>>> save_file = './testing' # or save_file = './testing.3s'
>>> myNews.save(save_file)
>>> sameNews = ns.load3s(save_file)
>>> results = myNews.evaluate(sameNews)
>>> print(results)
{'Precision': 1.0, 'Recall': 1.0, 'F1': 1.0, 'WD': 0.0, 'Pk': 0.0}
This saves the trees generated (not the Segmentation
instance) inside a .3s
file given as a parameter.
If you want to run the same database several times (for algorithm design, parameter testing or other reasons) you should
use the cache serialization system. This system stores into a .json
file all the embeddings generated in the SLM
.
If any sentence is repeated, the system will not compute the embeddings again. All sentences computed in the SLM
are
stored into the cache_file
if provided. Here is an example of speeding up process:
>>> import time
>>>
>>> myDatabase = ['./1.vtt', './2.vtt', './3.vtt']
>>> cache_file = './cache.json'
>>> lcm_parameters = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
>>> ellapsed_time = list()
>>>
>>> for parameter in lcm_parameters:
>>> initial_time = time.perf_counter()
>>> for news in database:
>>> myNews = ns.Segmentation(news, lcm=(parameter,), cache_file=cache_file)
>>> ellapsed_time.append(time.perf_counter() - initial_time)
>>>
>>> [print(f'{i + 1} iteration: {seconds} seconds.') for i, seconds in enumerate(ellapsed_time)]
1: iteration 89.23 seconds.
2: iteration 9.28 seconds.
3: iteration 8.91 seconds.
4: iteration 12.2 seconds.
5: iteration 7.22 seconds.
6: iteration 13.9 seconds.
If any further speed up is needed. The model reads the original files (.VTT)
and stores it as temporal .TXT
files. If the
model is reading continuously this files, it is better to process the .VTT
files to .TXT
once, store it, and give the model the .TXT
files instead.
This skips the first preprocessing step in every iteration. You can do something similar to this:
>>> in_files = ['./1.vtt', './2.vtt', './3.vtt', './4.vtt', './5.vtt']
>>> txt_files = [default_dbt(vtt_file) for vtt_file in in_files]
>>> times = 200
>>> for i in range(times):
>>> for txt_file in txt_files:
>>> myNews = ns.Segmentation(txt_file)
This method speeds slightly up the process, and it is only adequate if the file is going to be transformed more than once.
Implement the abstract class NewSegmentation
for implementing custom algorithms, use this demo as a template:
import newsegmentation as ns
class MySegmentation(ns.NewsSegmentation):
@staticmethod
def _spatial_manager(r, param):
# return ns.default_sdm(r, param)
return myown_sdm(r, param)
@staticmethod
def _specific_language_model(s):
# return ns.default_slm(s)
return myown_slm(s)
@staticmethod
def _later_correlation_manager(lm, s, t, param):
# return ns.default_lcm(lm, s, t, param)
return myown_lcm(lm, s, t, param)
@staticmethod
def _database_transformation(path, op):
# return ns.default_dbt(path, op)
return myown_dbt(path, op)
Note that ns.default_xxx
is the default manager for the architecture and can be replaced by your own functions.
Take into account the following constraints before implementing your own module managers:
Comparing two different algorithms inside the architecture. LGA is a kernel-based algorithm with cellular automation techniques. PBMM algorithm is the default algorithm and has better F1 score performance and reliability. This is tested over Spanish news broadcast database with 10 files.
@<not_available_yet>{palomo2022alonso,
title={A Flexible Architecture using Temporal, Spatial and Semantic
Correlation-based algorithms for Story Segmentation of Broadcast News},
author={A.Palomo-Alonso, D.Casillas-Pérez, S.Jiménez-Fernández, A.Portilla-Figueras, S.Salcedo-Sanz},
year={2022}
}
FAQs
Package for news segmentation architecture.
We found that newsegmentation demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.