Was this page helpful?
Caution
You're viewing documentation for an unstable version of ScyllaDB Migrator. Switch to the latest stable version.
This page documents the schema of the YAML configuration file used by the Migrator and the Validator.
The configuration file is a YAML object whose fields are enumerated, preceded by a comment describing their role. All the fields are mandatory except if their documentation starts with “Optional”. All the values of the form <xxx>
are placeholders that should be replaced with your specific settings.
The YAML format is whitespace sensitive, make sure to use the proper number of spaces to keep on the same level of indentation all the properties that belong to the same object. If the configuration file is not correctly formatted, the Migrator will fail at startup with a message like “DecodingFailure at …” or “ParsingFailure …” describing the problem. For instance, the following line in the logs means that the mandatory property target
is missing from the configuration file:
Exception in thread "main" DecodingFailure at .target: Missing required field
The configuration file requires the following top-level properties (ie, with no leading space before the property names), which are documented further below:
# Source configuration
source:
# ...
# Target configuration
target:
# ...
# Optional - Columns to rename
renames:
# ...
# Savepoints configuration
savepoints:
# ...
# Validator configuration. Required only if the app is executed in validation mode.
validation:
# ...
# Optional - Used internally
skipTokenRanges: []
# Optional - Used internally
skipSegments: []
These top-level properties are documented in the following sections (except skipTokenRanges
and skipSegments
, which are used internally).
The source
property describes the type of data to read from. It must be an object with a field type
defining the type of source, and other fields depending on the type of source.
Valid values for the source type
are:
cassandra
for a CQL-compatible source (Apache Cassandra or ScyllaDB).
parquet
for a dataset stored using the Parquet format.
dynamodb
for a DynamoDB-compatible source (AWS DynamoDB or ScyllaDB Alternator).
dynamodb-s3-export
for a DynamoDB table exported to S3.
The following subsections detail the schema of each source type.
A source of type cassandra
can be used together with a target of type cassandra
only.
source:
type: cassandra
# Host name of one of the nodes of your database cluster
host: <cassandra-server-01>
# TCP port to use for CQL
port: 9042
# Optional - Connection credentials
credentials:
username: <username>
password: <pass>
# Optional - Datacenter to use
localDC: <datacenter>
# Keyspace in which the table is located
keyspace: <keyspace>
# Name of the table to read
table: <table>
# Consistency Level for the source connection.
# Options are: LOCAL_ONE, ONE, LOCAL_QUORUM, QUORUM.
# We recommend using LOCAL_QUORUM. If using ONE or LOCAL_ONE, ensure the source system is fully repaired.
consistencyLevel: LOCAL_QUORUM
# Preserve TTLs and WRITETIMEs of cells in the source database. Note that this
# option is *incompatible* when copying tables with collections (lists, maps, sets).
preserveTimestamps: true
# Number of splits to use - this should be at minimum the amount of cores
# available in the Spark cluster, and optimally more; higher splits will lead
# to more fine-grained resumes. Aim for 8 * (Spark cores).
splitCount: 256
# Number of connections to use to Apache Cassandra when copying
connections: 8
# Number of rows to fetch in each read
fetchSize: 1000
# Optional - SSL options as per https://github.com/scylladb/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-ssl-connection-options
sslOptions:
clientAuthEnabled: false
enabled: false
# all below are optional! (generally just trustStorePassword and trustStorePath is needed)
trustStorePassword: <pass>
trustStorePath: <path>
trustStoreType: JKS
keyStorePassword: <pass>
keyStorePath: <path>
keyStoreType: JKS
enabledAlgorithms:
- TLS_RSA_WITH_AES_128_CBC_SHA
- TLS_RSA_WITH_AES_256_CBC_SHA
protocol: TLS
# Optional - Condition to filter data that will be migrated
where: race_start_date = '2015-05-27' AND race_end_date = '2015-05-27'
A source of type parquet
can be used together with a target of type cassandra
only.
source:
type: parquet
# Path of the Parquet file.
# It can be a file located on the Spark master node filesystem (e.g. '/some-directory/some-file.parquet'),
# or a file stored on S3 (e.g. 's3a://some-bucket/some-file.parquet')
path: <path>
# Optional - in case of a file stored on S3, the AWS credentials to use
credentials:
# ... see the “AWS Authentication” section below
A source of type dynamodb
can be used together with a target of type dynamodb
only.
source:
type: dynamodb
# Name of the table to write. If it does not exist, it will be created on the fly.
table: <table>
# Connect to a custom endpoint. Mandatory if writing to ScyllaDB Alternator.
endpoint:
# If writing to ScyllaDB Alternator, prefix the hostname with 'http://'.
host: <host>
port: <port>
# Optional - AWS availability region.
region: <region>
# Optional - Authentication credentials. See the section “AWS Authentication” for more details.
credentials:
accessKey: <access-key>
secretKey: <secret-key>
# Optional - Split factor for reading. The default is to split the source data into chunks
# of 128 MB that can be processed in parallel by the Spark executors.
scanSegments: 1
# Optional - Throttling settings, set based on your database read capacity units (or wanted capacity)
readThroughput: 1
# Optional - Can be between 0.1 and 1.5, inclusively.
# 0.5 represents the default read rate, meaning that the job will attempt to consume half of the read capacity of the table.
# If you increase the value above 0.5, spark will increase the request rate; decreasing the value below 0.5 decreases the read request rate.
# (The actual read rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table.)
throughputReadPercent: 1.0
# Optional - At most how many tasks per Spark executor? The default is to use the same as 'scanSegments'.
maxMapTasks: 1
The properties scanSegments
and maxMapTasks
can have significant impact on the migration throughput. By default, the migrator splits the data into segments of 128 MB each.
Use maxMapTasks
to cap the parallelism level used by the Spark executor when processing each segment.
A source of type dynamodb-s3-export
can be used together with a target of type dynamodb
only.
source:
type: dynamodb-s3-export
# Name of the S3 bucket where the DynamoDB table has been exported
bucket: <bucket-name>
# Key of the `manifest-summary.json` object in the bucket
manifestKey: <manifest-summary-key>
# Optional - Connect to a custom endpoint instead of the standard AWS S3 endpoint
endpoint:
# Specify the hostname without a protocol
host: <host>
port: <port>
# Optional - AWS availability region
region: <region>
# Optional - Connection credentials. See the section “AWS Authentication” below for more details.
credentials:
accessKey: <access-key>
secretKey: <secret-key>
# Key schema and attribute definitions, see https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TableCreationParameters.html
tableDescription:
# See https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_AttributeDefinition.html
attributeDefinitions:
- name: <attribute-name>
type: <attribute-type>
# ... other attributes
# See https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_KeySchemaElement.html
keySchema:
- name: <key-name>
type: <key-type>
# ... other key schema definitions
# Optional - Whether to use “path-style access” in S3 (see https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html). Default is false.
usePathStyleAccess: true
The target
property describes the type of data to write. It must be an object with a field type
defining the type of target, and other fields depending on the type of target.
Valid values for the target type
are:
cassandra
for a CQL-compatible target (Apache Cassandra or ScyllaDB).
dynamodb
for a DynamoDB-compatible target (DynamoDB or ScyllaDB Alternator).
The following subsections detail the schema of each target type.
target:
type: cassandra
# Host name of one of the nodes of your target database cluster
host: <scylla-server-01>
# TCP port for CQL
port: 9042
# Keyspace to use
keyspace: <keyspace>
# Optional - Datacenter to use
localDC: <datacenter>
# Optional - Authentication credentials
credentials:
username: <username>
password: <pass>
# Name of the table to write. If it does not exist, it will be created on the fly.
# It has to have the same schema as the source table. If needed, you can rename
# columns along the way, look at the documentation page “Rename Columns”.
table: <table>
# Consistency Level for the target connection
# Options are: LOCAL_ONE, ONE, LOCAL_QUORUM, QUORUM.
consistencyLevel: LOCAL_QUORUM
# Number of connections to use to ScyllaDB / Apache Cassandra when copying
connections: 16
# Spark pads decimals with zeros appropriate to their scale. This causes values
# like '3.5' to be copied as '3.5000000000...' to the target. There's no good way
# currently to preserve the original value, so this flag can strip trailing zeros
# on decimal values before they are written.
stripTrailingZerosForDecimals: false
# Optional - If we do not persist timestamps (when preserveTimestamps is false in the source)
# we can enforce in writer a single TTL or writetimestamp for ALL written records.
# Such writetimestamp can be e.g. set to time BEFORE starting dual writes,
# and this will make your migration safe from overwriting dual write
# even for collections.
# ALL rows written will get the same TTL or writetimestamp or both
# (you can uncomment just one of them, or all or none)
# TTL in seconds (sample 7776000 is 90 days)
writeTTLInS: 7776000
# Optional - writetime in microseconds (sample 1640998861000 is Saturday, January 1, 2022 2:01:01 AM GMT+01:00 )
writeWritetimestampInuS: 1640998861000
# Optional - SSL as per https://github.com/scylladb/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-ssl-connection-options
sslOptions:
clientAuthEnabled: false
enabled: false
# all below are optional! (generally just trustStorePassword and trustStorePath is needed)
trustStorePassword: <pass>
trustStorePath: <path>
trustStoreType: JKS
keyStorePassword: <pass>
keyStorePath: <path>
keyStoreType: JKS
enabledAlgorithms:
- TLS_RSA_WITH_AES_128_CBC_SHA
- TLS_RSA_WITH_AES_256_CBC_SHA
protocol: TLS
target:
type: dynamodb
# Name of the table to write. If it does not exist, it will be created on the fly.
table: <table>
# Optional - Throttling settings, set based on your database write capacity units (or wanted capacity).
# By default, for provisioned tables we use the configured write capacity units, and for on-demand tables we use the value 40000.
writeThroughput: 1
# Optional - Can be between 0.1 and 1.5, inclusively.
# 0.5 represents the default write rate, meaning that the job will attempt to consume half of the write capacity of the table.
# If you increase the value above 0.5, spark will increase the request rate; decreasing the value below 0.5 decreases the write request rate.
# (The actual write rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table.)
throughputWritePercent: 1.0
# When transferring DynamoDB sources to DynamoDB targets (such as other DynamoDB tables or Alternator tables),
# the migrator supports transferring live changes occurring on the source table after transferring an initial
# snapshot.
# Please see the documentation page “Stream Changes” for more details about this option.
streamChanges: false
# Optional - When streamChanges is true, skip the initial snapshot transfer and only stream changes.
# This setting is ignored if streamChanges is false.
skipInitialSnapshotTransfer: false
The optional renames
property lists the item columns to rename along the migration.
renames:
- from: <source-column-name>
to: <target-column-name>
# ... other columns to rename
When migrating data from Apache Cassandra or DynamoDB, the migrator is able to resume an interrupted migration. To achieve this, it stores so-called “savepoints” along the process to remember which data items have already been migrated and should be skipped when the migration is restarted.
savepoints:
# Where should savepoint configurations be stored? This is a path on the host running
# the Spark driver - usually the Spark master.
path: /app/savepoints
# Interval at which savepoints will be created
intervalSeconds: 300
The validation
field and its properties are mandatory only when the application is executed in validation mode.
validation:
# Should WRITETIMEs and TTLs be compared?
compareTimestamps: true
# What difference should we allow between TTLs?
ttlToleranceMillis: 60000
# What difference should we allow between WRITETIMEs?
writetimeToleranceMillis: 1000
# How many differences to fetch and print
failuresToFetch: 100
# What difference should we allow between floating point numbers?
floatingPointTolerance: 0.001
# What difference in ms should we allow between timestamps?
timestampMsTolerance: 0
When reading from DynamoDB or S3, or when writing to DynamoDB, the communication with AWS can be configured with the properties credentials
, endpoint
, and region
in the configuration:
credentials:
accessKey: <access-key>
secretKey: <secret-key>
# Optional - AWS endpoint configuration
endpoint:
host: <host>
port: <port>
# Optional - AWS availability region, required if you use a custom endpoint
region: <region>
Additionally, you can authenticate with AssumeRole. In such a case, the accessKey
and secretKey
are the credentials of the user whose access to the resource (DynamoDB table or S3 bucket) has been granted via a “role”, and you need to add the property assumeRole
as follows:
credentials:
accessKey: <access-key>
secretKey: <secret-key>
assumeRole:
arn: <role-arn>
# Optional - Session name to use. If not set, we use 'scylla-migrator'.
sessionName: <role-session-name>
# Note that the region is mandatory when you use `assumeRole`
region: <region>
Was this page helpful?