Try Great Expectations
Start here to learn how to connect to sample data, build an Expectation, validate sample data, and review Validation Results. This is an ideal place to start if you're new to GX 1.0 and want to experiment with features and see what it offers.
Prerequisites
Setup
GX 1.0 is a Python library you can install with the Python pip
tool.
For more comprehensive guidance on setting up a Python environment, installing GX 1.0, and installing additional dependencies for specific data formats and storage environments, see Set up a GX environment.
-
Run the following terminal command to install the GX 1.0 library:
Terminal inputpip install great_expectations
-
Verify GX 1.0 installed successfully:
Terminal inputgreat_expectations --version
The following output appears when GX 1.0 is successfully installed:
Terminal outputgreat_expectations, version 1.0.0a4
Test features and functionality
- Procedure
- Sample code
-
Import the
great_expectations
library andexpectations
module.The
great_expectations
module is the root of the GX library and contains shortcuts and convenience methods for starting a GX project in a Python session.The
expectations
module contains all the Expectation classes that are provided by the GX library.Run the following code in a Python interpreter, IDE, or script:
Python inputimport great_expectations as gx
import great_expectations.expectations as gxe -
Create a temporary Data Context and connect to sample data.
In Python, a Data Context provides the API for interacting with many common GX objects.
Run the following code to initialize a Data Context and then use it to read the contents of a
.csv
file into a Batch of sample data:Python inputcontext = gx.get_context()
batch = context.data_sources.pandas_default.read_csv(
"https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)You'll use this sample data to test your Expectations.
-
Create an Expectation.
Expectations are a fundamental component of GX. They allow you to explicitly define the state to which your data should conform.
The sample data you're using is taxi trip record data. With this data, you can make certain assumptions. For example, the passenger count shouldn't be zero because at least one passenger needs to be present. Additionally, a taxi can accomodate a maximum of six passengers.
Run the following code to define an Expectation that the contents of the column
passenger_count
consist of values ranging from1
to6
:Python inputexpectation = gxe.ExpectColumnValuesToBeBetween(
column="passenger_count", min_value=1, max_value=6
) -
Run the following code to validate the sample data against your Expectation and view the results:
Python inputvalidation_result = batch.validate(expectation)
print(validation_result.describe())The sample data conforms to the defined Expectation and the following Validation Results are returned:
Python output{
"type": "expect_column_values_to_be_between",
"success": true,
"kwargs": {
"batch_id": "default_pandas_datasource-#ephemeral_pandas_asset",
"column": "passenger_count",
"min_value": 1.0,
"max_value": 6.0
},
"result": {
"element_count": 10000,
"unexpected_count": 0,
"unexpected_percent": 0.0,
"partial_unexpected_list": [],
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_percent_total": 0.0,
"unexpected_percent_nonmissing": 0.0,
"partial_unexpected_counts": [],
"partial_unexpected_index_list": []
}
} -
Optional. Create an Expectation that will fail when validated against the provided data.
A failed Expectation lets you know there is something wrong with the data, such as missing or incorrect values, or there is a misunderstanding about the data.
Run the following code to create an Expectation that fails because it assumes that a taxi can seat a maximum of three passengers:
Python inputfailed_expectation = gxe.ExpectColumnValuesToBeBetween(
column="passenger_count", min_value=1, max_value=3
)
failed_validation_result = batch.validate(failed_expectation)
print(failed_validation_result.describe())When an Expectation fails, the Validation Results of the failed Expectation include metrics to help you assess the severity of the issue:
Python output{
"type": "expect_column_values_to_be_between",
"success": false,
"kwargs": {
"batch_id": "default_pandas_datasource-#ephemeral_pandas_asset",
"column": "passenger_count",
"min_value": 1.0,
"max_value": 3.0
},
"result": {
"element_count": 10000,
"unexpected_count": 853,
"unexpected_percent": 8.53,
"partial_unexpected_list": [
4,
4,
4,
4,
4,
4,
4,
4,
4,
4,
4,
4,
4,
4,
4,
4,
4,
4,
4,
4
],
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_percent_total": 8.53,
"unexpected_percent_nonmissing": 8.53,
"partial_unexpected_counts": [
{
"value": 4,
"count": 20
}
],
"partial_unexpected_index_list": [
9147,
9148,
9149,
9150,
9151,
9152,
9153,
9154,
9155,
9156,
9157,
9158,
9159,
9160,
9161,
9162,
9163,
9164,
9165,
9166
]
}
}To reduce the size of the report and make it easier to review, only a portion of the failed values and record indexes are included in the Validation Results. The failed counts and percentages correspond to the failed records in the validated data.
-
Optional. Go to the Expectations Gallery and experiment with other Expectations.
# Import required modules from the GX library
import great_expectations as gx
import great_expectations.expectations as gxe
# Create a temporary Data Context and connect to provided sample data.
context = gx.get_context()
batch = context.data_sources.pandas_default.read_csv(
"https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)
# Create an Expectation
expectation = gxe.ExpectColumnValuesToBeBetween(
column="passenger_count", min_value=1, max_value=6
)
# Validate the sample data against your Expectation and view the results
validation_result = batch.validate(expectation)
print(validation_result.describe())
Next steps
-
Check out GX Cloud, our SaaS platform—it's now in public preview! Sign up here and you could be validating your data in minutes. We also offer regular GX Cloud workshops: click here to get more information and register.
-
To learn more about GX 1.0, see Community resources.
-
If you're ready to start using GX 1.0 with your own data, the Set up a GX environment documentation provides a more comprehensive guide to setting up GX to work with specific data formats and environments.