# NAME Flat::Schema - Deterministic schema contracts for flat files # WHY THIS EXISTS (IN ONE PARAGRAPH) In real ETL work, yesterday's CSV becomes today's "contract" whether you meant it or not. Flat::Schema makes that contract explicit: generate a deterministic schema from what you observed, record ambiguity as issues, and give the next step (validation) something stable to enforce. # SYNOPSIS Basic usage: use Flat::Profile; use Flat::Schema; my $profile = Flat::Profile->profile_file( file => "data.csv", ); my $schema = Flat::Schema->from_profile( profile => $profile, ); print Flat::Schema->new()->to_json(schema => $schema); With overrides: my $schema = Flat::Schema->from_profile( profile => $profile, overrides => [ { column_index => 0, set => { type => 'integer', nullable => 0 } }, { column_index => 3, set => { name => 'created_at', type => 'datetime' } }, ], ); # DESCRIPTION Flat::Schema consumes reports produced by [Flat::Profile](https://metacpan.org/pod/Flat%3A%3AProfile) and generates a deterministic, inspectable schema contract describing what tabular data **should** look like. It is the second module in the Flat::\* series: - Flat::Profile — What the data looks like - Flat::Schema — What the data should look like - Flat::Validate — Does the data conform (planned) The schema is a canonical Perl data structure that: - Is stable and deterministic (identical inputs → identical output) - Is serializable to JSON and YAML - Captures inference decisions and ambiguity as issues - Can be consumed by Flat::Validate or other tooling # REAL-WORLD USE CASES (THE STUFF YOU ACTUALLY DO) ## 1) Vendor “helpfully” changes a column (integer → text) You ingest daily files and one day a numeric column starts containing values like `N/A`, `unknown`, or `ERR-17`. Your pipeline should not silently coerce this into zero or drop rows. Workflow: 1. Profile last-known-good 2. Generate schema (your contract) 3. Validate future drops against the schema A typical override when you decide "we accept this as string now": my $schema = Flat::Schema->from_profile( profile => $profile, overrides => [ { column_index => 7, set => { type => 'string' } }, ], ); Flat::Schema will record that the override conflicts with what it inferred, and that record is useful during incident review. ## 2) Columns that are “nullable in real life” even if today they are not Data often arrives complete in a sample window and then starts missing values in production. In v1, nullability is intentionally simple: nullable = true iff null_count > 0 If you know a field is nullable even if today it isn't, force it: overrides => [ { column_index => 2, set => { nullable => 1 } }, # allow missing later ], ## 3) Timestamp confusion: date vs datetime vs “whatever the exporter did” When temporal evidence mixes, Flat::Schema chooses predictability over cleverness. - date + datetime → datetime - temporal + non-temporal → string (and it tells you) This prevents “maybe parseable” data from becoming quietly wrong later. ## 4) “Header row roulette” and naming cleanup You may get headers like `Customer ID`, `customer_id`, `CUSTID`, or no header at all. Schema stores both: - `index` always - `name` when available If you need normalized naming for downstream systems: overrides => [ { column_index => 0, set => { name => 'customer_id' } }, ], ## 5) Reproducible artifacts for tickets, audits, and “what changed?” Sometimes the most important feature is being able to paste the schema into a ticket, diff it in Git, or keep it as a build artifact. Flat::Schema’s serializers are deterministic by design. If the schema changes, it is because the inputs changed (profile or overrides), not because hash order shifted. # SCHEMA STRUCTURE (AT A GLANCE) A generated schema contains: { schema_version => 1, generator => { name => "Flat::Schema", version => "0.01" }, profile => { ... }, columns => [ ... ], issues => [ ... ], } Each column contains: { index => 0, name => "id", type => "integer", nullable => 0, length => { min => 1, max => 12 }, # optional overrides => { ... }, # optional provenance => { basis => "profile", rows_observed => 1000, null_count => 0, null_rate => { num => 0, den => 1000 }, overrides => [ "type", "nullable" ], # optional }, } # TYPE INFERENCE (v1) Type inference is based solely on evidence provided by Flat::Profile. Scalar widening order: boolean → integer → number → string Temporal handling: date + datetime → datetime temporal + non-temporal → string (with warning) Mixed evidence is widened and recorded as an issue. # NULLABILITY INFERENCE (v1) Rules: - nullable = true iff null\_count > 0 - If rows\_profiled == 0, all columns are nullable - All-null columns emit warning `all_null_column` - Zero profiled rows emits warning `no_rows_profiled` # USER OVERRIDES (v1) Overrides are applied after inference. Supported fields: - type - nullable - name - length (min/max) Overrides: - Are index-based (column\_index required) - May conflict with inferred values (recorded as warnings) - Are recorded in column.overrides - Are recorded in provenance.overrides - Emit an informational `override_applied` issue Overrides referencing unknown columns cause a hard error. # DETERMINISTIC SERIALIZATION Flat::Schema includes built-in deterministic JSON and YAML serializers. Same input profile + same overrides → identical JSON/YAML. This is required for reproducible pipelines and meaningful diffs. # STATUS Implemented in v1: - Canonical schema structure - Deterministic serialization - Type inference - Nullability inference - User overrides (index-based) Future releases may expand the type lattice, constraint modeling, and schema evolution. # AUTHOR Sergio de Sousa # LICENSE This library is free software; you may redistribute it and/or modify it under the same terms as Perl itself.