Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

wikipedia-airport-scraper

Package Overview
Dependencies
Maintainers
1
Versions
6
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

wikipedia-airport-scraper

Get airport codes and flight connections from Wikipedia airport pages

  • 0.6.0
  • latest
  • Source
  • npm
  • Socket score

Version published
Maintainers
1
Created
Source

Wikipedia Airport Scraper

A small Node.js script to scrape info about airports and their destinations from Wikipedia pages. When provided with the full HTML of any airport page on the mobile version of the English language Wikipedia, it will extract:

  • IATA and ICAO codes for this airport
  • Coordinates
  • A list of all flights listed with info on the destination airport, airline and any start and end dates. Also includes flags to indicate if a destination has been suspended and whether it's seasonal and/or operated as a charter flight. Basically anything from the 'Airlines and destinations' table as a consistent and formatted output.

It is left to any script that uses this to:

  • Set up requests to the en.m.wikipedia.org pages, grab responses and rate limit those requests where necessary
  • Store and process any output from the scraper
  • Make further requests to lookup destination airports or link airline names to IATA/ICAO codes.

Right now, this script doesn't provide any way to look up basic data found on airline pages and as such can't help you to link names to codes. Sunch functionality might be added in the future.

How to use

Here's a very simple example that gets data for Brussels Airport from the wikipedia url:

import got from 'got' // Or any other package that requests a HTML page
import write from 'write' // Or any other package that writes data to a local file

import scrape from 'wikipedia-airport-scraper'

// Get the HTML from the page and pass it to the script
const data = await got('https://en.m.wikipedia.org/wiki/Brussels_Airport').then((response) => scrape(response.body))

// Write out the scraped data
const outputPath = new URL('./data.json', import.meta.url).pathname
await write(outputPath, JSON.stringify(data, null, 2))

The data (simplified to show only one airline and one destination) then looks like this:

{
  "name": "Brussels Airport",
  "iataCode": "BRU",
  "icaoCode": "EBBR",
    "coordinates": {
    "latitude": 50.901389,
    "longitude": 4.484444
  },
  "flights": [
    {
      "airline": {
        "name": "Aegean Airlines",
        "link": "Aegean_Airlines"
      },
      "destination": {
        "shortName": "Athens",
        "fullName": "Athens International Airport",
        "link": "Athens_International_Airport",
        "isCharter": false,
        "isSeasonal": false,
        "suspended": false,
        "startDate": null,
        "endDate": null
      }
    }
  ]
}

Caveats

  • It's obvious but probably deserves to be said: the output of this script can only be as good as the Wikipedia page that it uses as input. YMMV.

  • Two different but related airlines might be mapped to the same link (and ultimately IATA code) by Wikipedia. Here's part of the output from Kansai International Airport that shows All Nippon Airways and ANA Wings with a different name but the same link, in this case serving the same route. For now, the script will not recognise this as a duplicate.

{
  "flights": [
    {
      "airline": {
        "name": "All Nippon Airways",
        "link": "All_Nippon_Airways"
      },
      "destination": {
        "shortName": "Naha",
        "fullName": "Naha Airport",
        "link": "Naha_Airport",
        "isCharter": false,
        "isSeasonal": false,
        "suspended": false,
        "startDate": null,
        "endDate": null
      }
    },
    {
      "airline": {
        "name": "ANA Wings",
        "link": "All_Nippon_Airways"
      },
      "destination": {
        "shortName": "Naha",
        "fullName": "Naha Airport",
        "link": "Naha_Airport",
        "isCharter": false,
        "isSeasonal": false,
        "suspended": false,
        "startDate": null,
        "endDate": null
      }
    }
  ]
}
  • Not every destination airport that the script picks up on airport pages will have an actual link. If the link leads to a Wikipedia edit page, it will appear in the JSON as null. Here's part of the output from Ignatyevo Airport that shows Zeya as a destination airport with no Wikipedia page to link to:
{
  "flights": [
    {
      "airline": {
        "name": "Angara Airlines",
        "link": "Angara_Airlines"
      },
      "destination": {
        "shortName": "Zeya",
        "fullName": null,
        "link": null,
        "isCharter": false,
        "isSeasonal": true,
        "suspended": false,
        "startDate": null,
        "endDate": null
      }
    }
  ]
}

Keywords

FAQs

Package last updated on 11 Dec 2024

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc