It is a assumed that a Dumper class definition is the source of truth for all tables that it manages, and that no other processes affect the tables' data or schema. You can break this assumption, but you should understand how the incremental logic works and what will and will not trigger a full table upload. Dumpers have a export_genre method that determines the what Dumps to look at when calculating incrementals.

High level, the HD backend will look for a past dump of the same genre. If not found, a full upload of all tables is triggered. If found, each table's schema is compared; any tables with mismatched schema (determined by hashing) will do a full upload.
Note that Procs in the schema are not included in the hash calculation. If you change a Proc implementation and need to trigger a full-upload of the table, you'll need to change something else too (like the version).

Here is an example Dumper implementation, wrapped in an ActiveJob job:

class HostedDataPushJob < ApplicationJob
  # The schema serves two purposes: defining the schema and mapping data
  SCHEMA = InstDataShipper::SchemaBuilder.build do
    # You can augment the Table-builder DSL with custom methods like so:
    extend_table_builder do
      # It may be useful to define a custom column definition helpers:
      def custom_column(*args, from: nil, **kwargs, &blk)
        # In this example, the helper reads the value from a `data` jsonb column - without it, you'd need
        #   to define `from: ->(row) { row.data["<KEY>"] }` on each column that needs to read from the jsonb
        from ||= args[0].to_s
        from = ->(row) { row.data[from] } if from.is_a?(String)
        column(*args, **kwargs, from: from, &blk)
      end

      # `extend_table_builder` uses `class_eval`, so you could alternatively write your helpers in a Concern or Module and include them like normal:
      include SomeConcern
    end

    table(ALocalModel, "<TABLE DESCRIPTION>") do
      # If you define a table as incremental, it'll only export changes made since the start of the last successful Dumper run
      #  The first argument "scope" can be interpreted in different ways:
      #    If exporting a local model it may be a: (default: `updated_at`)
      #      Proc that will receive a Relation and return a Relation (use `incremental_since`)
      #      String of a column to compare with `incremental_since`
      #    If exporting a Canvas report it may be a: (default: `updated_after`)
      #      Proc that will receive report params and return modified report params (use `incremental_since`)
      #      String of a report param to set to `incremental_since`
      #  `on:` is passed to Hosted Data and is used as the unique key. It may be an array to form a composite-key
      #  `if:` may be a Proc or a Symbol (of a method on the Dumper)
      incremental "updated_at", on: [:id], if: ->() {}

      # Schemas may declaratively define the data source.
      # This can be used for basic schemas where there's a 1:1 mapping between source table and destination table, and there is no conditional logic that needs to be performed.
      # In order to apply these statements, your Dumper must call `auto_enqueue_from_schema`.
      source :local_table
      # A Proc can also be passed. The below is equivalent to the above
      source ->(table_def) { import_local_table(table_def[:model] || table_def[:warehouse_name]) }

      # You may manually note a version on the table.
      # Note that if a version is present, the version value replaces the hash-comparison when calculating incrementals, so you must change the version whenever the schema changes enough to trigger a full-upload
      version "1.0.0"

      column :name_in_destinations, :maybe_optional_sql_type, "Optional description of column"

      # The type may usually be omitted if the `table()` is passed a Model class, but strings are an exception to this
      column :name, :"varchar(128)"

      # `from:` May be...
      # A Symbol of a method to be called on the record
      column :sis_type, :"varchar(32)", from: :some_model_method
      # A String of a column to read from the record
      column :sis_type, :"varchar(32)", from: "sis_source_type"
      # A Proc to be called with each record
      column :sis_type, :"varchar(32)", from: ->(rec) { ... }
      # Not specified. Will default to using the Schema Column Name as a String ("sis_type" in this case)
      column :sis_type, :"varchar(32)"
    end

    table("my_table", model: ALocalModel) do
      # ...
    end

    table("proserv_student_submissions_csv") do
      column :canvas_id, :bigint, from: "canvas user id"
      column :sis_id, :"varchar(64)", from: "sis user id"
      column :name, :"varchar(64)", from: "user name"
      column :submission_id, :bigint, from: "submission id"
    end
  end

  Dumper = InstDataShipper::Dumper.define(schema: SCHEMA, include: [
    InstDataShipper::DataSources::LocalTables,
    InstDataShipper::DataSources::CanvasReports,
  ]) do
    import_local_table(ALocalModel)
    import_canvas_report_by_terms("proserv_student_submissions_csv", terms: Term.all.pluck(:canvas_id))

    # If the report_name/Model don't directly match the Schema, a schema_name: parameter may be passed:
    import_local_table(SomeModel, schema_name: "my_table")
    import_canvas_report_by_terms("some_report", terms: Term.all.pluck(:canvas_id), schema_name: "my_table")

    # Iterate through the Tables defined in the Schema and apply any defined `source` statements.
    # This is the default behavior if `define()` is called w/o a block.
    auto_enqueue_from_schema
  end

  def perform
    Dumper.perform_dump([
      "hosted-data://<JWT>@<HOSTED DATA SERVER>?table_prefix=example",
      "s3://<access_key_id>:<access_key_secret>@<region>/<bucket>/<path>",
    ])
  end
end

Dumpers may also be formed as a normal Ruby subclass:

class HostedDataPushJob < ApplicationJob
  SCHEMA = InstDataShipper::SchemaBuilder.build do
    # ...
  end

  class Dumper < InstDataShipper::Dumper
    include InstDataShipper::DataSources::LocalTables
    include InstDataShipper::DataSources::CanvasReports

    def enqueue_tasks
      import_local_table(ALocalModel)
      import_canvas_report_by_terms("proserv_student_submissions_csv", terms: Term.all.pluck(:canvas_id))

      # auto_enqueue_from_schema
    end

    def table_schemas
      SCHEMA
    end
  end

  def perform
    Dumper.perform_dump([
      "hosted-data://<JWT>@<HOSTED DATA SERVER>?table_prefix=example",
      "s3://<access_key_id>:<access_key_secret>@<region>/<bucket>/<path>",
    ])
  end
end

Destinations

This Gem is mainly designed for use with Hosted Data, but it tries to abstract that a little to allow for other destinations/backends. Out of the box, support for Hosted Data and S3 are included.

Destinations are passed as URI-formatted strings. Passing Hashes is also supported, but the format/keys are destination specific.

Destinations blindly accept URI Fragments (the # chunk at the end of the URI). These options are not used internally but will be made available as dest.user_config. Ideally these are in the same format as query parameters (x=1&y=2, which it will try to parse into a Hash), but it can be any string.

Hosted Data

hosted-data://<JWT>@<HOSTED DATA SERVER>

Optional Parameters:

table_prefix: An optional string to prefix onto each table name in the schema when declaring the schema in Hosted Data

S3

s3://<access_key_id>:<access_key_secret>@<region>/<bucket>/<optional path>

Optional Parameters:

None

Development

When adding to or updating this gem, make sure you do the following:

Update the yardoc comments where necessary, and confirm the changes by running yardoc --server
Write specs
If you modify the model or migration templates, run bundle exec rake update_test_schema to update them in the Rails Dummy application (and commit those changes)

Docs

Docs can be generated using yard. To view the docs:

Clone this gem's repository
bundle install
yard server --reload

The yard server will give you a URL you can visit to view the docs.

FAQs

What is inst_data_shipper?

Is inst_data_shipper well maintained?

Package last updated on 04 Oct 2024

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install