Apache Iceberg Tables: Complete Guide with PySpark Examples (2026)

Last Updated: June 2026  ·  14 min read

Quick Answer

An Apache Iceberg table is an open table format that adds ACID transactions, time travel, schema evolution, and partition evolution on top of Parquet files in object storage. It lets data engineers safely update, delete, and version massive datasets without rewriting entire partitions. Works natively with PySpark, Trino, Flink, and AWS Athena.

If you've ever had a 3 AM page because someone ran a bad DELETE on a Hive table and wiped a production partition — you'll understand exactly why Apache Iceberg exists.

I've run Spark pipelines on EMR processing hundreds of millions of events daily. Before Iceberg, fixing a bad write meant restoring from S3 backups, replaying hours of Kafka messages, and explaining to the team why the dashboard was wrong. After migrating to Iceberg, a bad write is just: roll back to the last snapshot, done.

This guide covers everything you need to use Apache Iceberg in production — with working PySpark code you can run today.


What Is an Apache Iceberg Table?

Apache Iceberg is an open table format for analytic datasets at petabyte scale. It was created by Netflix engineers and is now a top-level Apache project.

Think of it as a layer that sits between your query engine (Spark, Trino, Athena) and your raw Parquet files on S3. This layer tracks:

  • Which files belong to the current table snapshot
  • The full history of every write operation
  • Column types, partition specs, and sort orders

The result is a table that behaves like a database — with ACID guarantees, safe concurrent writes, and the ability to query any point in history.

Iceberg vs Hive Table: The Core Difference

Feature Hive Table Iceberg Table
ACID transactions No (unless ORC + LLAP) Yes — every write
Time travel No Yes — snapshot-based
Schema evolution Limited, risky Safe — add/drop/rename columns
Partition evolution No — must rewrite Yes — no rewrite needed
Row-level updates No Yes — MERGE INTO
Concurrent writers Unsafe Safe
Engine support Hive, Spark Spark, Trino, Flink, Athena, Snowflake

Setting Up Iceberg with PySpark (Quickstart)

Before any code, you need the Iceberg Spark runtime. The version depends on your Spark + Scala combination.

# For Spark 3.5 + Scala 2.12
pip install pyspark==3.5.0

# Download the Iceberg runtime jar (or add to spark-submit --packages)
# org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2

Starting a SparkSession with Iceberg

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("IcebergDemo")
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2")
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.local.type", "hadoop")
    .config("spark.sql.catalog.local.warehouse", "/tmp/iceberg-warehouse")
    .getOrCreate()
)

For AWS EMR / S3: Replace hadoop catalog with glue catalog and point warehouse to your S3 bucket. The rest of the code is identical.

Create Your First Iceberg Table

spark.sql("""
    CREATE TABLE local.db.events (
        event_id     BIGINT,
        user_id      BIGINT,
        event_type   STRING,
        payload      STRING,
        created_at   TIMESTAMP
    )
    USING iceberg
    PARTITIONED BY (days(created_at))
""")

Notice PARTITIONED BY (days(created_at)) — this is hidden partitioning. Iceberg automatically partitions by day from the timestamp. No manual dt=2026-06-23 partition column needed in your data.

Insert Data

from pyspark.sql import Row
from datetime import datetime

rows = [
    Row(event_id=1, user_id=101, event_type="click", payload='{"page":"home"}', created_at=datetime(2026, 6, 20, 10, 0, 0)),
    Row(event_id=2, user_id=102, event_type="purchase", payload='{"item":"pro"}', created_at=datetime(2026, 6, 21, 14, 0, 0)),
    Row(event_id=3, user_id=103, event_type="click", payload='{"page":"pricing"}', created_at=datetime(2026, 6, 22, 9, 0, 0)),
]

df = spark.createDataFrame(rows)
df.writeTo("local.db.events").append()

Time Travel — Query Any Point in History

Every write to an Iceberg table creates a new snapshot. You can query any snapshot by ID or by timestamp — no backups needed.

List All Snapshots

spark.sql("SELECT snapshot_id, committed_at, operation FROM local.db.events.snapshots").show()
+-------------------+--------------------+---------+
|        snapshot_id|        committed_at|operation|
+-------------------+--------------------+---------+
|5674839201837462910|2026-06-20 10:01:23|   append|
|8921038472019384756|2026-06-21 14:02:11|   append|
|3847201938472016473|2026-06-22 09:03:45|overwrite|
+-------------------+--------------------+---------+

Query by Snapshot ID

# Read exactly what the table looked like after the first write
df_snapshot = spark.read \
    .option("snapshot-id", "5674839201837462910") \
    .table("local.db.events")

df_snapshot.show()

Query by Timestamp

# What was in the table at 11pm on June 21?
df_history = spark.read \
    .option("as-of-timestamp", "1750550400000") \  # epoch ms
    .table("local.db.events")

df_history.show()

SQL Syntax for Time Travel

-- By timestamp
SELECT * FROM local.db.events
TIMESTAMP AS OF '2026-06-21 23:00:00';

-- By snapshot ID
SELECT * FROM local.db.events
VERSION AS OF 5674839201837462910;

Real use case: A bad pipeline ran at 2am and overwrote good data. With time travel, you SELECT from the snapshot just before the bad write, write the result back to the table, and the problem is fixed in minutes — not hours of Kafka replay.


Schema Evolution — Change Columns Without Rewriting Data

Iceberg tracks columns by internal ID, not by name. This means you can safely rename, add, or drop columns and Iceberg will still read old Parquet files correctly.

Add a Column

spark.sql("ALTER TABLE local.db.events ADD COLUMN region STRING AFTER user_id")

Done. Old files don't have region — Iceberg returns null for those rows. New writes include the column. No data migration.

Rename a Column

spark.sql("ALTER TABLE local.db.events RENAME COLUMN payload TO event_payload")

Old Parquet files had the column stored with an internal ID. Iceberg maps the old ID to the new name. All old data reads correctly under the new name.

Drop a Column

spark.sql("ALTER TABLE local.db.events DROP COLUMN region")

The column is removed from new reads. Old Parquet files still have the bytes on disk, but Iceberg ignores them. No data rewrite.

Change Column Type (Safe Promotions Only)

# INT → BIGINT is safe (widening)
spark.sql("ALTER TABLE local.db.events ALTER COLUMN event_id TYPE BIGINT")

# BIGINT → INT is NOT safe — Iceberg will reject it

Safe type promotions: int → long, float → double, decimal(p,s) → decimal(p2,s) where p2 > p.


Partition Evolution — Change Partitioning Without Migration

This is one of Iceberg's most powerful features. Traditional Hive tables require full data rewrites to change partitioning. Iceberg does not.

Original Partition: By Day

# Table was created with daily partitioning
spark.sql("""
    CREATE TABLE local.db.telemetry_events (
        metric_id   BIGINT,
        host        STRING,
        value       DOUBLE,
        ts          TIMESTAMP
    )
    USING iceberg
    PARTITIONED BY (days(ts))
""")

After 6 months, query patterns shift — most queries filter by host AND day. You want to add host to the partition spec.

Add a Partition Field (No Data Rewrite)

spark.sql("ALTER TABLE local.db.telemetry_events ADD PARTITION FIELD host")

From this moment: - New data is written into ts_day=2026-06-23/host=prod-server-01/ style partitions - Old data stays in ts_day=2026-01-01/ partitions unchanged - Queries spanning old and new data work transparently — Iceberg merges both partition specs at query time

# This query works across both old (day-only) and new (day+host) partitions
spark.sql("""
    SELECT date_trunc('day', ts) as day, host, avg(value) as avg_val
    FROM local.db.telemetry_events
    WHERE ts >= '2026-01-01'
    GROUP BY 1, 2
""").show()

Row-Level Updates and Deletes (MERGE INTO)

Raw Parquet files are immutable — you can't update a row in place. Iceberg solves this with copy-on-write or merge-on-read strategies, exposed through standard SQL.

UPDATE a Single Column

spark.sql("""
    UPDATE local.db.events
    SET event_type = 'purchase_confirmed'
    WHERE event_type = 'purchase' AND user_id = 102
""")

DELETE Rows

spark.sql("""
    DELETE FROM local.db.events
    WHERE event_id = 3
""")

MERGE INTO (Upsert) — The Most Useful Operation

This is the pattern I use daily in production pipelines. You have a batch of incoming events — some are new rows, some are updates to existing rows.

# Incoming batch — could be new events or corrections to existing ones
incoming = [
    Row(event_id=2, user_id=102, event_type="refund", payload='{"item":"pro","reason":"cancel"}', created_at=datetime(2026, 6, 21, 14, 0, 0)),
    Row(event_id=4, user_id=104, event_type="signup", payload='{"plan":"free"}', created_at=datetime(2026, 6, 23, 8, 0, 0)),
]
df_incoming = spark.createDataFrame(incoming)
df_incoming.createOrReplaceTempView("incoming_events")

spark.sql("""
    MERGE INTO local.db.events AS target
    USING incoming_events AS source
    ON target.event_id = source.event_id

    WHEN MATCHED THEN
        UPDATE SET
            target.event_type  = source.event_type,
            target.payload     = source.payload

    WHEN NOT MATCHED THEN
        INSERT (event_id, user_id, event_type, payload, created_at)
        VALUES (source.event_id, source.user_id, source.event_type, source.payload, source.created_at)
""")

After MERGE: event_id=2 is updated to refund, event_id=4 is inserted as a new row.

Production tip: Always join on the primary key AND a time range (AND target.created_at >= '2026-06-01') to limit the files Iceberg must scan. MERGE without a time filter on a petabyte table will scan everything.


Snapshot Management — Keep S3 Clean

Every write creates a new snapshot and new Parquet files. Old files accumulate. You must expire old snapshots on a schedule.

Expire Snapshots Older Than 7 Days

from datetime import datetime, timedelta

cutoff = datetime.utcnow() - timedelta(days=7)
cutoff_ts = int(cutoff.timestamp() * 1000)  # Iceberg needs epoch ms

spark.sql(f"""
    CALL local.system.expire_snapshots(
        table => 'db.events',
        older_than => TIMESTAMP '{cutoff.strftime('%Y-%m-%d %H:%M:%S.000')}'
    )
""")

Remove Orphan Files

Files that were written but never committed (crashed jobs) linger on S3. Clean them up:

spark.sql("""
    CALL local.system.remove_orphan_files(
        table => 'db.events',
        older_than => TIMESTAMP '2026-06-16 00:00:00.000'
    )
""")

Rewrite Small Files (Compaction)

Streaming pipelines produce many tiny Parquet files. Compact them for faster queries:

spark.sql("""
    CALL local.system.rewrite_data_files(
        table => 'db.events',
        strategy => 'sort',
        sort_order => 'event_id ASC'
    )
""")

Run compaction weekly on high-write tables. On Telemetrix's pipeline, compaction cut Athena query costs by ~40% by reducing file count from thousands to dozens per partition.


Apache Iceberg with AWS Athena

If you're already on AWS, Athena v3 supports Iceberg out of the box — no Spark cluster needed for most operations.

Create an Iceberg Table in Athena

CREATE TABLE events_iceberg (
    event_id    BIGINT,
    user_id     BIGINT,
    event_type  VARCHAR,
    created_at  TIMESTAMP
)
LOCATION 's3://your-bucket/iceberg/events/'
TBLPROPERTIES (
    'table_type'        = 'ICEBERG',
    'format'            = 'parquet',
    'write_compression' = 'snappy'
);

MERGE INTO in Athena

MERGE INTO events_iceberg AS target
USING incoming_batch AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN
    UPDATE SET event_type = source.event_type
WHEN NOT MATCHED THEN
    INSERT VALUES (source.event_id, source.user_id, source.event_type, source.created_at);

Time Travel in Athena

SELECT * FROM events_iceberg
FOR TIMESTAMP AS OF TIMESTAMP '2026-06-20 10:00:00 UTC';

Common Mistakes Data Engineers Make with Iceberg

1. Skipping snapshot expiry — Without expire_snapshots, old Parquet files accumulate on S3 indefinitely. Set up a weekly Lambda or Spark job to expire snapshots older than your SLA window.

2. MERGE without a partition filter — A MERGE that touches the full table rewrites every affected file. Always add a time range predicate to limit the scan scope.

3. Using wrong catalog config — Iceberg needs a catalog (Hadoop, Glue, Hive, REST). Mixing catalog configs between Spark and Athena causes table-not-found errors. Use Glue Data Catalog as the single source of truth on AWS.

4. Over-partitioning — Iceberg's hidden partitioning makes it tempting to add many partition fields. Each extra field multiplies the directory tree. Start with one time-based partition (days or months) and only add more if query plans show full partition scans.

5. Ignoring rewrite_data_files — Streaming ingestion creates thousands of tiny files. Without compaction, Athena and Trino query planning slows down as the file list grows. Schedule compaction.


solutiongigs.in Data Tools

Working with large-scale data pipelines and need to inspect or convert files quickly? The free tools at solutiongigs.in handle Parquet previews, JSON/CSV conversion, and SQL formatting without any local setup. Useful when you need a quick look at Iceberg metadata files or output Parquet without spinning up a Spark session.


Frequently Asked Questions

What is an Apache Iceberg table?

An Apache Iceberg table is an open table format designed for huge analytic datasets. It adds ACID transactions, time travel, schema evolution, and partition evolution on top of plain Parquet or ORC files stored in object storage like S3. Unlike Hive tables, Iceberg tracks every change in metadata files, making large-scale data lake operations safe and reliable.

What is the difference between Iceberg and Delta Lake?

Both Iceberg and Delta Lake solve the same core problem — ACID transactions on data lakes. The key differences: Iceberg is engine-agnostic and works natively with Spark, Trino, Flink, Athena, and Snowflake. Delta Lake originated from Databricks and has tighter Spark integration but historically had less support in non-Spark engines. Iceberg's partition evolution is more flexible — you can change partitioning without rewriting data, which Delta Lake cannot do.

Does Apache Iceberg work with AWS Athena?

Yes. AWS Athena v3 supports Apache Iceberg tables natively. You can create, query, and run MERGE/UPDATE/DELETE on Iceberg tables directly from Athena SQL — no Spark cluster needed for most operations. Iceberg tables on Athena store data in S3 and use the AWS Glue Data Catalog to track metadata.

How does Iceberg time travel work?

Iceberg time travel works through snapshots. Every write operation (INSERT, UPDATE, DELETE, MERGE) creates a new snapshot with a unique snapshot ID and timestamp. To query historical data, you specify either a snapshot ID or a timestamp in your query. Iceberg reads the metadata files for that snapshot and fetches exactly the Parquet files that were valid at that point in time — no data is overwritten.

Can you update individual rows in an Iceberg table?

Yes. Apache Iceberg supports row-level updates and deletes using copy-on-write or merge-on-read strategies. The MERGE INTO statement lets you upsert data — inserting new rows and updating existing ones in a single operation. This is one of Iceberg's biggest advantages over raw Parquet files, which are immutable and cannot be updated in place.

What is schema evolution in Iceberg?

Schema evolution in Iceberg means you can add, drop, rename, or reorder columns without rewriting existing data files. Iceberg tracks column IDs internally — not just column names — so a renamed column is still correctly read in old data files. This is safe to do in production because existing queries continue to work and existing Parquet files are not touched.

How do I expire old Iceberg snapshots?

Use the expire_snapshots procedure to remove old snapshots and reclaim S3 storage. In PySpark: CALL catalog.system.expire_snapshots('db.table', TIMESTAMP '2026-01-01 00:00:00.000'). This deletes snapshot metadata and orphaned data files older than the given timestamp. Always keep at least a few days of snapshots for recovery. Run this on a schedule — weekly is a common cadence in production.


Conclusion

Apache Iceberg tables solve the problems that have made data lake engineering painful for years — bad writes you can't roll back, partition changes that require full rewrites, schema changes that break downstream jobs, and row-level updates that require overwriting entire partitions.

The key things to take away:

  • Time travel lets you roll back any bad write in minutes — query any snapshot by ID or timestamp
  • Schema evolution is safe — add, rename, and drop columns without touching existing Parquet files
  • Partition evolution changes your partitioning strategy with no data migration
  • MERGE INTO gives you real upserts on object storage — pair it with a partition filter in production
  • Snapshot expiry + compaction keep S3 costs under control — schedule both

If you're running Spark on EMR or using Athena and you're still on raw Hive tables, Iceberg is the migration that pays for itself the first time a bad pipeline runs at 3am.

Need to inspect Parquet files or convert data formats without spinning up a cluster? Try the free tools at solutiongigs.in — no sign-up, no watermarks.


Mohammed Yaseen

Mohammed Yaseen

Founder, SolutionGigs

Mohammed has built and operated data pipelines processing hundreds of millions of events daily on AWS EMR and Spark — including Iceberg table migrations on production Kafka + Athena infrastructure. He founded SolutionGigs to bring production-grade engineering tools and tutorials to the developer community. LinkedIn →