Skip to main content
Version: 1.0 prerelease

Try Great Expectations

Start here to learn how to connect to sample data, build an Expectation, validate sample data, and review Validation Results. This is an ideal place to start if you're new to GX 1.0 and want to experiment with features and see what it offers.

Prerequisites

Setup

GX 1.0 is a Python library you can install with the Python pip tool.

For more comprehensive guidance on setting up a Python environment, installing GX 1.0, and installing additional dependencies for specific data formats and storage environments, see Set up a GX environment.

  1. Run the following terminal command to install the GX 1.0 library:

    Terminal input
    pip install great_expectations
  2. Verify GX 1.0 installed successfully:

    Terminal input
    great_expectations --version

    The following output appears when GX 1.0 is successfully installed:

    Terminal output
    great_expectations, version 1.0.0a4

Test features and functionality

  1. Import the great_expectations library and expectations module.

    The great_expectations module is the root of the GX library and contains shortcuts and convenience methods for starting a GX project in a Python session.

    The expectations module contains all the Expectation classes that are provided by the GX library.

    Run the following code in a Python interpreter, IDE, or script:

    Python input
    import great_expectations as gx
    import great_expectations.expectations as gxe
  2. Create a temporary Data Context and connect to sample data.

    In Python, a Data Context provides the API for interacting with many common GX objects.

    Run the following code to initialize a Data Context and then use it to read the contents of a .csv file into a Batch of sample data:

    Python input
    context = gx.get_context()
    batch = context.data_sources.pandas_default.read_csv(
    "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
    )

    You'll use this sample data to test your Expectations.

  3. Create an Expectation.

    Expectations are a fundamental component of GX. They allow you to explicitly define the state to which your data should conform.

    The sample data you're using is taxi trip record data. With this data, you can make certain assumptions. For example, the passenger count shouldn't be zero because at least one passenger needs to be present. Additionally, a taxi can accomodate a maximum of six passengers.

    Run the following code to define an Expectation that the contents of the column passenger_count consist of values ranging from 1 to 6:

    Python input
    expectation = gxe.ExpectColumnValuesToBeBetween(
    column="passenger_count", min_value=1, max_value=6
    )
  4. Run the following code to validate the sample data against your Expectation and view the results:

    Python input
    validation_result = batch.validate(expectation)
    print(validation_result.describe())

    The sample data conforms to the defined Expectation and the following Validation Results are returned:

    Python output
    {
    "type": "expect_column_values_to_be_between",
    "success": true,
    "kwargs": {
    "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset",
    "column": "passenger_count",
    "min_value": 1.0,
    "max_value": 6.0
    },
    "result": {
    "element_count": 10000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_counts": [],
    "partial_unexpected_index_list": []
    }
    }
  5. Optional. Create an Expectation that will fail when validated against the provided data.

    A failed Expectation lets you know there is something wrong with the data, such as missing or incorrect values, or there is a misunderstanding about the data.

    Run the following code to create an Expectation that fails because it assumes that a taxi can seat a maximum of three passengers:

    Python input
    failed_expectation = gxe.ExpectColumnValuesToBeBetween(
    column="passenger_count", min_value=1, max_value=3
    )
    failed_validation_result = batch.validate(failed_expectation)
    print(failed_validation_result.describe())

    When an Expectation fails, the Validation Results of the failed Expectation include metrics to help you assess the severity of the issue:

    Python output
    {
    "type": "expect_column_values_to_be_between",
    "success": false,
    "kwargs": {
    "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset",
    "column": "passenger_count",
    "min_value": 1.0,
    "max_value": 3.0
    },
    "result": {
    "element_count": 10000,
    "unexpected_count": 853,
    "unexpected_percent": 8.53,
    "partial_unexpected_list": [
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4,
    4
    ],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 8.53,
    "unexpected_percent_nonmissing": 8.53,
    "partial_unexpected_counts": [
    {
    "value": 4,
    "count": 20
    }
    ],
    "partial_unexpected_index_list": [
    9147,
    9148,
    9149,
    9150,
    9151,
    9152,
    9153,
    9154,
    9155,
    9156,
    9157,
    9158,
    9159,
    9160,
    9161,
    9162,
    9163,
    9164,
    9165,
    9166
    ]
    }
    }

    To reduce the size of the report and make it easier to review, only a portion of the failed values and record indexes are included in the Validation Results. The failed counts and percentages correspond to the failed records in the validated data.

  6. Optional. Go to the Expectations Gallery and experiment with other Expectations.

Next steps