Skip to content

Use s3select and Python to Retrieve Data from a CSV Stored in a Bucket

An s3 bucket is useful for more than just storing your data, you can also use s3select to query and analyze CSV data stored in your bucket without the need to retrieve it first.

This example uses a dataset of randomly-generated "sales data".

Here's an example excerpt of the CSV records (there are 40,000 total records in this dataset).

Text Only
date,item,customer
2025-12-06,ITEM_087,CUST_2237
2025-12-25,ITEM_012,CUST_3088
2025-12-22,ITEM_043,CUST_2320
2025-12-06,ITEM_002,CUST_1085
2025-12-05,ITEM_097,CUST_3040
2025-12-17,ITEM_066,CUST_2429
2025-12-05,ITEM_052,CUST_2042
2025-12-25,ITEM_084,CUST_1897
2025-12-11,ITEM_025,CUST_1048

Here's an example Python script.

Python
import boto3
import json
import io
import argparse

# CSV Schema
# date,item,customer

# --- Boto3 Client Setup ---
# Assumes you have configured your AWS credentials in the AWS_PROFILE environment variable
s3_client = boto3.client('s3')

def run_s3_select_query(QUERY,BUCKET, OBJECT):
    """
    Executes the S3 Select query and streams the results.
    """
    try:
        # 1. Call select_object_content
        response = s3_client.select_object_content(
            Bucket=BUCKET,
            Key=OBJECT,
            ExpressionType='SQL',
            Expression=QUERY,
            InputSerialization={
                'CSV': {
                    'FileHeaderInfo': 'Use',
                    'FieldDelimiter': ',',
                }
            },
            OutputSerialization={'CSV': {}}
        )
    except Exception as e:
        print(f"An error occurred during S3 Select API call: {e}")
        return

    records_stream = response['Payload']
    csv_records = io.StringIO()

    for event in records_stream:
        if 'Records' in event:
            data = event['Records']['Payload'].decode('utf-8')
            csv_records.write(data)

        # The 'Stats' event provides useful metadata about the query execution
        elif 'Stats' in event:
            stats = event['Stats']['Details']

        # The 'End' event signals the end of the data stream
        elif 'End' in event:
            break

    final_output = csv_records.getvalue()

    if final_output:
        print(f"{final_output.strip()}")
    else:
        print("\nNo records found matching the select criteria.")




if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-c", "--csv", dest="object_name", type=str, default='sales_data.csv', help="Your csv file name" )
    parser.add_argument("-b", "--bucket", dest="bucket_name", type=str, default='select-example', help="Your bucket name")
    parser.add_argument("-i", "--customer-id", dest="customer_id", type=str, required=True, help="Customer ID")
    args = parser.parse_args()
    sql_query = f"SELECT _1, _2 FROM S3OBJECT WHERE _3 = \"{args.customer_id}\""
    run_s3_select_query(QUERY=sql_query,BUCKET=args.bucket_name, OBJECT=args.object_name )

Example script output.

Text Only
$ ./s3_select.py -i CUST_2237 
2025-12-06,ITEM_087
2025-12-25,ITEM_059
2025-12-23,ITEM_048
2025-12-22,ITEM_041
2025-12-06,ITEM_026
2025-12-10,ITEM_004
2025-12-11,ITEM_024
2025-12-25,ITEM_078
2025-12-15,ITEM_076
2025-12-27,ITEM_019
2025-12-04,ITEM_057
2025-12-16,ITEM_086
2025-12-15,ITEM_034
2025-12-20,ITEM_043
2025-12-22,ITEM_059
2025-12-04,ITEM_055
2025-12-26,ITEM_058
2025-12-12,ITEM_058
2025-12-05,ITEM_022
2025-12-24,ITEM_049
2025-12-19,ITEM_033
2025-12-04,ITEM_003
2025-12-03,ITEM_082
2025-12-23,ITEM_098
2025-12-12,ITEM_028
2025-12-30,ITEM_099
2025-12-02,ITEM_058
2025-12-24,ITEM_086
2025-12-05,ITEM_011
2025-12-11,ITEM_012
2025-12-08,ITEM_051
2025-12-27,ITEM_010
2025-12-02,ITEM_063
2025-12-10,ITEM_052
2025-12-17,ITEM_046
2025-12-06,ITEM_097
2025-12-10,ITEM_100
2025-12-22,ITEM_039
2025-12-22,ITEM_060
2025-12-19,ITEM_090
2025-12-27,ITEM_056
2025-12-17,ITEM_013
2025-12-13,ITEM_018
2025-12-28,ITEM_072
2025-12-13,ITEM_062
2025-12-20,ITEM_097
2025-12-22,ITEM_057
2025-12-13,ITEM_019
2025-12-21,ITEM_014
2025-12-07,ITEM_080
2025-12-13,ITEM_098
2025-12-14,ITEM_008
2025-12-12,ITEM_003
2025-12-29,ITEM_052
2025-12-17,ITEM_061
2025-12-18,ITEM_030
2025-12-30,ITEM_091
2025-12-08,ITEM_009
2025-12-05,ITEM_088
2025-12-11,ITEM_036
2025-12-27,ITEM_070
2025-12-24,ITEM_094
2025-12-11,ITEM_025
2025-12-29,ITEM_070
2025-12-27,ITEM_064
2025-12-30,ITEM_065
2025-12-18,ITEM_070

While this approach does not turn an s3 object into a database, it is particularly useful on large or historical datasets that do not need to be queried often or are prohibitivly large to be stored in a traditional database. The query is done server side allowing retrieval of only relevant data without consuming local storage or procesing power.