Skip to main content
Version: 0.18.17

Get started with Great Expectations and SQL

Use the information provided here to learn how you can use Great Expectations (GX) with a SQL Data Source. The following examples use a PostgreSQL Database.

To use GX with PostgreSQL Database, you'll complete the following tasks:

  • Load data
  • Instantiate a Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components.
  • Create a Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. and a Data AssetA collection of records within a Data Source which is usually named based on the underlying data system and sliced to correspond to a desired specification.
  • Create an Expectation SuiteA collection of verifiable assertions about data.
  • Validate data using a CheckpointThe primary means for validating data in a production deployment of Great Expectations.

The full code used in the following examples is available on GitHub:

Prerequisites

  • A working PostgreSQL Database
  • A working Python environment

Install GX

  1. Run the following command to install GX in your Python environment:
pip install great-expectations
  1. Run the following command to import configuration information that you'll use in the following steps:
Python
import great_expectations as gx
from great_expectations.checkpoint import Checkpoint

Set up GX

To avoid configuring external resources, you'll use your local filesystem for your Metadata Stores and Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc. store.

Run the following code to create a Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. with the default settings:

Python
context = gx.get_context()

Connect to your data

  1. Use a connection_string to securely connect to your PostgreSQL instance. For example
PG_CONNECTION_STRING = "postgresql+psycopg2://postgres:@localhost/taxi_db"

Replace the connection string with the connection string for your database. For additional information about other connection methods, see How to configure credentials. In this example, existing New York City taxi cab data is being used.

  1. Run the following command to create a Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. to represent the data available in your PostgreSQL database:
Python
pg_datasource = context.sources.add_postgres(
name="pg_datasource", connection_string=PG_CONNECTION_STRING
)
  1. Run the following command to create a Data AssetA collection of records within a Data Source which is usually named based on the underlying data system and sliced to correspond to a desired specification. to represent a discrete set of data:
Python
pg_datasource.add_table_asset(
name="postgres_taxi_data", table_name="postgres_taxi_data"
)

In this example, the name of a specific table within your database is used.

  1. Run the following command to build a Batch RequestProvided to a Data Source in order to create a Batch. using the Data AssetA collection of records within a Data Source which is usually named based on the underlying data system and sliced to correspond to a desired specification. you configured previously:
Python
batch_request = pg_datasource.get_asset("postgres_taxi_data").build_batch_request()

Create Expectations

You'll use a ValidatorUsed to run an Expectation Suite against data. to interact with your batch of data and generate an Expectation SuiteA collection of verifiable assertions about data..

Every time you evaluate an Expectation with validator.expect_*, it is immediately Validated against your data. This instant feedback helps you identify unexpected data and removes the guesswork from data exploration. The Expectation configuration is stored in the Validator. When you are finished running the Expectations on the dataset, you can use validator.save_expectation_suite() to save all of your Expectation configurations into an Expectation Suite for later use in a checkpoint.

  1. Run the following command to create the suite and get a Validator:
Python
expectation_suite_name = "insert_your_expectation_suite_name_here"
context.add_or_update_expectation_suite(expectation_suite_name=expectation_suite_name)
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name=expectation_suite_name,
)

print(validator.head())
  1. Run the following command to use the Validator to add a few Expectations:
Python
validator.expect_column_values_to_not_be_null(column="passenger_count")

validator.expect_column_values_to_be_between(
column="congestion_surcharge", min_value=0, max_value=1000
)
  1. Run the following command to save your Expectation Suite (all the unique Expectation Configurations from each run of validator.expect_*) to your Expectation Store:
Python
validator.save_expectation_suite(discard_failed_expectations=False)

Validate your data

You'll create and store a CheckpointThe primary means for validating data in a production deployment of Great Expectations. for your batch, which you can use to validate and run post-validation actions.

  1. Run the following command to create the Checkpoint configuration that uses your Data Context:
Python
my_checkpoint_name = "my_sql_checkpoint"

checkpoint = Checkpoint(
name=my_checkpoint_name,
run_name_template="%Y%m%d-%H%M%S-my-run-name-template",
data_context=context,
batch_request=batch_request,
expectation_suite_name=expectation_suite_name,
action_list=[
{
"name": "store_validation_result",
"action": {"class_name": "StoreValidationResultAction"},
},
{"name": "update_data_docs", "action": {"class_name": "UpdateDataDocsAction"}},
],
)
  1. Run the following command to save the Checkpoint:
Python
context.add_or_update_checkpoint(checkpoint=checkpoint)
  1. Run the following command to run the Checkpoint and pass in your Batch Request (your data) and your Expectation Suite (your tests):
Python
checkpoint_result = checkpoint.run()

Your Checkpoint configuration includes the store_validation_result and update_data_docs actions. The store_validation_result action saves your validation results from the Checkpoint run and allows the results to be persisted for future use. The update_data_docs action builds Data Docs files for the validations run in the Checkpoint.

To learn more about Data validation and customizing Checkpoints, see Validate Data: Overview .

To view the full Checkpoint configuration, run print(checkpoint.get_config().to_yaml_str()).

Build and view Data Docs

Your Checkpoint contained an UpdateDataDocsAction, so your Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc. have already been built from the validation you ran and your Data Docs store contains a new rendered validation result.

Run the following command to open your Data Docs and review the results of your Checkpoint run:

context.open_data_docs()

Next steps

Now that you've created and saved a Data Context, Data Source, Data Asset, Expectation Suite, and Checkpoint, see Validate data with Expectations and Checkpoints to create a script to run the Checkpoint without the need to recreate your Data Assets and Expectations.