Was this page helpful?
This page explains how to fill the source
and target
properties of the configuration file to migrate data:
from Apache Cassandra, ScyllaDB, or from a Parquet file,
to Apache Cassandra or ScyllaDB.
In file config.yaml
, make sure to keep only one source
property and one target
property, and configure them as explained in the following subsections according to your case.
The data source
can be an Apache Cassandra or ScyllaDB table, or a Parquet file.
In both cases, when reading from Apache Cassandra or ScyllaDB, the type of source should be cassandra
in the configuration file. Here is a minimal source
configuration:
source:
type: cassandra
# Host name of one of the nodes of your database cluster
host: <cassandra-server-01>
# TCP port to use for CQL
port: 9042
# Keyspace in which the table is located
keyspace: <keyspace>
# Name of the table to read
table: <table>
# Consistency Level for the source connection.
# Options are: LOCAL_ONE, ONE, LOCAL_QUORUM, QUORUM.
# We recommend using LOCAL_QUORUM. If using ONE or LOCAL_ONE, ensure the source system is fully repaired.
consistencyLevel: LOCAL_QUORUM
# Preserve TTLs and WRITETIMEs of cells in the source database. Note that this
# option is *incompatible* when copying tables with collections (lists, maps, sets).
preserveTimestamps: true
# Number of splits to use - this should be at minimum the amount of cores
# available in the Spark cluster, and optimally more; higher splits will lead
# to more fine-grained resumes. Aim for 8 * (Spark cores).
splitCount: 256
# Number of connections to use to Apache Cassandra when copying
connections: 8
# Number of rows to fetch in each read
fetchSize: 1000
Where the values <cassandra-server-01>
, <keyspace>
, and <table>
should be replaced with your specific values.
Additionally, you can also set the following optional properties:
source:
# ... same as above
# Datacenter to use
localDC: <datacenter>
# Connection credentials
credentials:
username: <username>
password: <pass>
# SSL options as per https://github.com/scylladb/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-ssl-connection-options
sslOptions:
clientAuthEnabled: false
enabled: false
# all below are optional! (generally just trustStorePassword and trustStorePath is needed)
trustStorePassword: <pass>
trustStorePath: <path>
trustStoreType: JKS
keyStorePassword: <pass>
keyStorePath: <path>
keyStoreType: JKS
enabledAlgorithms:
- TLS_RSA_WITH_AES_128_CBC_SHA
- TLS_RSA_WITH_AES_256_CBC_SHA
protocol: TLS
# Condition to filter data that will be migrated
where: race_start_date = '2015-05-27' AND race_end_date = '2015-05-27'
Where <datacenter>
, <username>
, <pass>
, <path>
, and the content of the where
properties should be replaced with your specific values.
The Migrator can read data from a Parquet file located on the filesystem of the Spark master node, or on an S3 bucket. In both cases, set the source type to parquet
. Here is a complete source
configuration to read from the filesystem:
source:
type: parquet
path: /<my-directory/my-file.parquet>
Where <my-directory/my-file.parquet>
should be replaced with your actual file path.
Here is a minimal source
configuration to read the Parquet file from an S3 bucket:
source:
type: parquet
path: s3a://<my-bucket/my-key.parquet>
Where <my-bucket/my-key.parquet>
should be replaced with your actual S3 bucket and key.
In case the object is not public in the S3 bucket, you can provide the AWS credentials to use as follows:
source:
type: parquet
path: s3a://<my-bucket/my-key.parquet>
credentials:
accessKey: <access-key>
secretKey: <secret-key>
Where <access-key>
and <secret-key>
should be replaced with your actual AWS access key and secret key.
The Migrator also supports advanced AWS authentication options such as using AssumeRole. Please read the configuration reference for more details.
The migration target
can be Apache Cassandra or ScyllaDB. In both cases, we use the type cassandra
in the configuration. Here is a minimal target
configuration to write to Cassandra or ScyllaDB:
target:
# can be 'cassandra' or 'scylla', it does not matter
type: cassandra
# Host name of one of the nodes of your target database cluster
host: <scylla-server-01>
# TCP port for CQL
port: 9042
# Keyspace to use
keyspace: <keyspace>
# Name of the table to write. If it does not exist, it will be created on the fly.
# It has to have the same schema as the source table. If needed, you can rename
# columns along the way, look at the documentation page “Rename Columns”.
table: <table>
# Consistency Level for the target connection
# Options are: LOCAL_ONE, ONE, LOCAL_QUORUM, QUORUM.
consistencyLevel: LOCAL_QUORUM
# Number of connections to use to ScyllaDB / Apache Cassandra when copying
connections: 16
# Spark pads decimals with zeros appropriate to their scale. This causes values
# like '3.5' to be copied as '3.5000000000...' to the target. There's no good way
# currently to preserve the original value, so this flag can strip trailing zeros
# on decimal values before they are written.
stripTrailingZerosForDecimals: false
Where <scylla-server-01>
, <keyspace>
, and <table>
should be replaced with your specific values.
Additionally, you can also set the following optional properties:
target:
# ... same as above
# Datacenter to use
localDC: <datacenter>
# Authentication credentials
credentials:
username: <username>
password: <pass>
# SSL as per https://github.com/scylladb/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-ssl-connection-options
sslOptions:
clientAuthEnabled: false
enabled: false
# all below are optional! (generally just trustStorePassword and trustStorePath is needed)
trustStorePassword: <pass>
trustStorePath: <path>
trustStoreType: JKS
keyStorePassword: <pass>
keyStorePath: <path>
keyStoreType: JKS
enabledAlgorithms:
- TLS_RSA_WITH_AES_128_CBC_SHA
- TLS_RSA_WITH_AES_256_CBC_SHA
protocol: TLS
# If we do not persist timestamps (when preserveTimestamps is false in the source)
# we can enforce in writer a single TTL or writetimestamp for ALL written records.
# Such writetimestamp can be e.g. set to time BEFORE starting dual writes,
# and this will make your migration safe from overwriting dual write
# even for collections.
# ALL rows written will get the same TTL or writetimestamp or both
# (you can uncomment just one of them, or all or none)
# TTL in seconds (sample 7776000 is 90 days)
writeTTLInS: 7776000
# writetime in microseconds (sample 1640998861000 is Saturday, January 1, 2022 2:01:01 AM GMT+01:00 )
writeWritetimestampInuS: 1640998861000
Where <datacenter>
, <username>
, <pass>
, and <path>
should be replaced with your specific values.
In case you use the option trustStorePath
or keyStorePath
, use the --files
option in the spark-submit
invocation to let Spark copy the file to the worker nodes.
Was this page helpful?