Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
FLAP is an open-source tool for linking free-text addresses to Ordinance Survey Unique Property Reference Number (OS UPRN). You need to have a licence of OS UPRN and download the address premium product to use FLAP FLAP can be used at scale with a few lines of syntax.
Full deployment resources can be found in deploy
of this repository.
Please see:
deploy/linux
for using the tool on a linux serverdeploy/docker
for running FLAP from a docker containerdeploy/posit_setup_public
for launch FLAP jobs on POSIT workbenchUse flap.match
for matching address to database
from flap import match
input_csv = '[PATH_TO_INPUT_CSV_FILE]'
db_path = '[PATH_TO_THE_DB]'
results = match(
input_csv=input_csv,
db_path=db_path
)
Optional arguments that can be passed:
output_file_path
: str, default None
Path for saving the output csv file, containing ['input_id', 'input_address', 'uprn', 'score']. If None, results
are not savedraw_output_path
: str, default None
Path for save the batched raw output files. If None, results are not savedin_progress_log_path
: str, default None
Path for files indicating one batch is being processedmax_log_interval
: str, default 4800
The interval under which the programme thinks some process is actively working on itbatch_size
: int, default 10000
Size of each batchmax_workers
: int, default None
Number of processes. If None, the max cpu available is determined by flap.utils.cpu_count.available_cpu_count()
in_memory_db
: bool, default False
If in-memory SQLite database is used. If True, a temp database is created in shared memory cache from pre-built
csv filesclassifier_model_path
: str, default None
The path to the pretrained sklearn classifier model.
If None, the model is loaded from 'flap.file/model/*.clf'max_beam_width
: int, default 200
The max number of rows to be considered from UPRN databasescore_threshold
: float, default 0.3
The min score for early stop of matchingTo start matching a table with addresses to UPRN from scratch, the input data should be a .csv
file with the following format. Essentially, there should be an input_id
column which you can use to join the address to other tables and an input_address
column which is an free-text address. This input is usually concatenated from multiple fields in your raw data.
The function in this scenario is flap.match
input_id | input_address |
---|---|
xxxxxx1 | The Queens Medical Research Institute, 47 Little France Cres, Edinburgh EH16 4TJ |
xxxxxx2 | Queen Elizabeth University Hospital, 1345 Govan Rd, Glasgow G51 4TF |
xxxxxx3 | 47 Little France Crescent, Edinburgh EH16 4TJ |
xxxxxx4 | 1345 Govan Rd, Glasgow G51 4TF |
... | ... |
In many scenarios, there are suggestions of UPRNs for some of the addresses. For example, the data was processed with other tools like CURL or ASSIGN. It could be that there are some manual matching done. FLAP can use these suggestions to speed up the matching. First, FLAP will score the suggested UPRN matching. If the score passes a threshold, the suggested UPRN will be accepted. If not, it will be then matched as usual together with other addresses without UPRN suggestions.
If this is the scenario, the input should look like this. And the function to be used is flap.score_and_match
input_id | input_address | uprn |
---|---|---|
xxxxxx1 | The Queens Medical Research Institute, 47 Little France Cres, Edinburgh EH16 4TJ | 906426044 |
xxxxxx2 | Queen Elizabeth University Hospital, 1345 Govan Rd, Glasgow G51 4TF | |
xxxxxx3 | 47 Little France Crescent, Edinburgh EH16 4TJ | |
xxxxxx4 | 1345 Govan Rd, Glasgow G51 4TF | |
... | ... | ... |
I have divided the result table in two parts just for better reading experience. It would be in one table for FLAP output.
input_id | uprn | flap_eval_score | flap_match_score | flap_uprn |
---|---|---|---|---|
xxxxxxx1 | 906426044 | 0.6341964285714285 | 0.6341964285714285 | 906426044 |
xxxxxxx2 | 0.8225 | 906700404351 | ||
xxxxxxx3 | 0.46 | 906426044 | ||
xxxxxxx4 | 0.6225 | 906700404351 |
input_id | input_address | uprn_row |
---|---|---|
xxxxxxx1 | The Queens Medical Research Institute, 47 Little France Cres, Edinburgh EH16 4TJ | UNIVERSITY OF EDINBURGH,,,THE QUEENS MEDICAL RESEARCH INSTITUTE,47,,LITTLE FRANCE CRESCENT,,EDINBURGH BIOQUARTER,EDINBURGH,EH16 4TJ |
xxxxxxx2 | QUEEN ELIZABETH UNIVERSITY HOSPITAL 1345 Govan Rd, Glasgow G51 4TF | QUEEN ELIZABETH UNIVERSITY HOSPITAL,,,,1345,,GOVAN ROAD,,,GLASGOW,G51 4TF |
xxxxxxx3 | University of Edinburgh, 47 Little France Crescent, Edinburgh EH16 4TJ | UNIVERSITY OF EDINBURGH,,,THE QUEENS MEDICAL RESEARCH INSTITUTE,47,,LITTLE FRANCE CRESCENT,,EDINBURGH BIOQUARTER,EDINBURGH,EH16 4TJ |
xxxxxxx4 | 1345 Govan Rd, Glasgow G51 4TF | QUEEN ELIZABETH UNIVERSITY HOSPITAL,,,,1345,,GOVAN ROAD,,,GLASGOW,G51 4TF |
flap_match_score
is the confidence level of the matching. In general, a score >0.5 indicate a good match that you do not need to review, unless the input has tenement patterns regex'\d+F\d+'
like '2F3'
. Matchings with scores between 0.3 and 0.5 are a mix of low confidence correct match and mismatches. The confidence score is not a probability (or not calibrated), so that a score of 0.6 does NOT mean 60% of time it is correct.flap_uprn
is the UPRN matcheduprn_row
is the comma delimited values from the AddressPremium databaseflap_eval_score
is the score from scoring the suggested matchIn general, matches with score over 0.5
are almost always to be a good match,
with the caveat that addresses with patterns of regex('\d+F\d+)
like 2F3
might not be correct
because the guessing of how many flats per level is not always correct.
input_id | interpretation |
---|---|
xxxxxxx1 | The input has a suggested UPRN which is scored. The score is 0.63 which passes the threshold of 0.5 and accepted. The match is correct. FLAP has dealt with the abbreviation in the street name Cres and the missing ORGANISATION_NAME UNIVERSITY OF EDINBURGH |
xxxxxxx2 | The input has good quality and is matched to the correct UPRN. There is an abbreviation in the street name Rd |
xxxxxxx3 | The input has missing BUILDING_NAME . The matching is correct but the score is only 0.46. It is a False Negative. |
xxxxxxx4 | The input has missing ORGANISATION_NAME and an abbreviation Rd. The matching is correct with a score of 0.6225. Note that the score is lower than input no. xxxxxxx2 , but still passes the threshold |
The flap.match
function is the top-level api for matching an input table of addresses from scratch to UPRN.
from flap import match
input_csv = <path_to_your_input_csv_file>
db_path = <path_to_the_built_db>
results = match(
input_csv, db_path
)
print(results)
The flap.score_and_match
function is the top-level api for scoring the suggested uprn matchings and match the ones not pass the threshold and without UPRN suggestion.
from flap import score_and_match
input_csv = <path_to_your_input_csv_file>
db_path = <path_to_the_built_db>
results = match(
input_csv, db_path
)
print(results)
You may use the python function help
to see the detailed documentation on the use of optional arguments.
from flap import match, score_and_match
print(help(match))
print(help(score_and_match))
The database operations are handled by flap.database.sql.SqlDB
class
from flap.database.sql import SqlDB
db_path = <path_to_the_built_db>
sql_db = SqlDB(db_path)
print(sql_db.get_table_names())
print(sql_db.get_columns_of_table('indexed'))
print(sql_db.sql_query('select * from indexed limit 2'))
The flap.matcher.sql_matcher.SqlMatcher
class handles matching of addresses to UPRN. The flap.matcher.sql_matcher.SqlMatcher
class takes an obligatory argument of flap.database.sql.SqlDB
class, which specifies the database to match to.
from flap.matcher.sql_matcher import SqlMatcher
from flap.database.sql import SqlDB
db_path = <path_to_the_built_db>
sql_db = SqlDB(db_path)
matcher = SqlMatcher(sql_db)
address = '1345 GOVAN ROAD, GLASGOW G51 4TF'
match_results = matcher.match(address)
print(match_results)
The flap.parser.rule_parser_fast.RuleParserFast
class handles parsing information from address strings. The flap.parser.rule_parser_fast.RuleParserFast
class takes an obligatory argument of flap.database.sql.SqlDB
class, because it uses the vocabularies from the database.
from flap.parser.rule_parser_fast import RuleParserFast
from flap.database.sql import SqlDB
db_path = <path_to_the_built_db>
sql_db = SqlDB(db_path)
parser = RuleParserFast(sql_db)
parsed = parser.parse(address, method='all')
print(parsed)
Here, I demonstrate a simplified process of how an address is match to the database.
We start with having an input address
address = '1345 GOVAN ROAD, GLASGOW G51 4TF'
Step 1. Parse the address string and extract information that is useful for narrowing down the search
from flap.database.sql import SqlDB
from flap.parser.rule_parser_fast import RuleParserFast
db_path = <path_to_the_built_db>
sql_db = SqlDB(db_path)
parser = RuleParserFast(sql_db)
parsed = parser.parse(address)
postcode = parsed['FOR_QUERY']['POSTCODE']
print(postcode)
Step 2. Narrowing down to local area using SQL query
local_area = sql_db.sql_query(f'select * from indexed where POSTCODE=="{postcode}"')
headers = sql_db.get_columns_of_table('indexed')
print(local_area)
Step 3. Generation of features for each address-UPRN pair
from flap.matcher.sql_matcher import prepare_uprn, \
get_number_like_matching_matrix, postcode_matching, summarize_features
from flap.alignment.linear_assignment_alignment import LinearAssignmentAlignment
records = {}
for query_res in local_area:
d_uprn = {k: v for k, v in zip(headers, query_res)} # Convert the UPRN query to dict
uprn_prepared = prepare_uprn(d_uprn)
print(uprn_prepared)
# Generation of features based on the free-text part of address and UPRN are dealt with using linear assignment
# See https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html
seq1 = parsed['TEXTUAL']
seq2 = uprn_prepared['TEXTUAL']
laa = LinearAssignmentAlignment(seq1, seq2)
alignment_results = laa.get_result()
text_align_features = alignment_results.get_score()
# Generation of features based on deterministic part of address and UPRN are dealt with parsing and pairwise comparison
sn1 = list(parsed['NUMBER_LIKE'].values())
sn2 = uprn_prepared['NUMBER_LIKE']
mat = get_number_like_matching_matrix(sn1, sn2)
# Lastly we need features from comparing the postcodes
pc1 = parsed['POSTCODE_SPLIT']
pc2 = uprn_prepared['POSTCODE_SPLIT']
pc_match = postcode_matching(p1, p2)
# Features are concat here
features = summerize_features(mat, text_align_features, pc_match)
# Store everything
records[d_uprn['UPRN']] = features
Step 4. Score the features corresponding to UPRNs and get the UPRN with highest score
import os
import flap
from flap.matcher.sql_matcher import ClassifierScorer
MODULE_PATH = os.path.dirname(flap.__file__)
DEFAULT_MODEL_PATH = [os.path.join(MODULE_PATH, 'model', path)
for path in os.listdir(os.path.join(MODULE_PATH, 'model')) if 'clf' in path][0]
scorer = ClassifierScorer(DEFAULT_MODEL_PATH)
match_scores = {k: scorer.score(v) for k: v in records.items()}
best_match = max(match_scores, key=lambda k: match_scores[k])
print(best_match)
FAQs
An open-source tool for linking free-text addresses to UPRN
We found that flap-lite demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.