Skip to main content

Migrating Microsoft Excel and LibreOffice Calc files into Drupal

Today we will learn how to migrate content from Microsoft Excel and LibreOffice Calc files into Drupal using the Migrate Spreadsheet module. We will give instructions on getting the module and its dependencies. Then, we will present how to configure the module for spreadsheets with or without a header row. There are two example migrations: images and paragraphs. Let’s get started.

Example configuration for Microsoft Excel and LibreOffice Calc migration.

Getting the code

You can get the full code example at https://github.com/dinarcon/ud_migrations.

The module to enable, as in yesterday's post in which we imported Google Sheets, is UD Google Sheets, Microsoft Excel, and LibreOffice Calc source migration whose machine name is ud_migrations_sheets_sources. It comes with four migrations: udm_google_sheets_source_node.yml, udm_libreoffice_calc_source_paragraph.yml, udm_microsoft_excel_source_image.yml, and udm_backup_csv_source_node.yml. The image migration uses a Microsoft Excel file as source. The paragraph migration uses a LibreOffice Calc file as source. The CSV migration is a backup in case the Google Sheet is not available. To execute the last one you would need the Migrate Source CSV module.

You can get the Migrate Spreadsheets module using composer: composer require drupal/migrate_spreadsheet:^1.0. This module depends on the PHPOffice/PhpSpreadsheet library and many PHP extensions including ext-zip. Check this page for a full list of dependencies. If any required extension is missing the installation will fail. If your Drupal site is not composer-based, you will not be able to use Migrate Spreadsheet, unless you jump through a lot of hoops.

Understanding the example set up

This migration will reuse the same configuration from the introduction to paragraph migrations example. Refer to that article for details on the configuration. The destinations will be the same content type, paragraph type, and fields. The source will be changed in today's example, as we use it to explain Microsoft Excel and LibreOffice Calc migrations. The end result will again be nodes containing an image and a paragraph with information about someone’s favorite book. The major difference is that we are going to read from different sources.

Note: You can literally swap migration sources without changing any other part of the migration.  This is a powerful feature of ETL frameworks like Drupal’s Migrate API. Although possible, the example includes slight changes to demonstrate various plugin configuration options. Also, some machine names had to be changed to avoid conflicts with other examples in the demo repository.

Understanding the source document and plugin configuration

In any migration project, understanding the source is very important. For Microsoft Excel and LibreOffice Calc migrations, the primary thing to consider is whether or not the file contains a row of headers. Also, a workbook (file) might contain several worksheets (tabs). You can only migrate from one worksheet at a time. The example documents have two worksheets: UD Example Sheet and Do not peek in here. We are going to be working with the first one.

The spreadsheet source plugin exposes seven configuration options. The values to use might change depending on the presence of a header row, but all of them apply for both types of document. Here is a summary of the available configurations:

  • file is required. It stores the path to the document to process. You can use a relative path from the Drupal root, an absolute path, or stream wrappers.
  • worksheet is required. It contains the name of the one worksheet to process.
  • header_row is optional. This number indicates which row containing the headers. Contrary to CSV migrations, the row number is not zero-based. So, set this value to 1 if headers are on the first row, 2 if they are on the second, and so on.
  • origin is optional and defaults to A2. It indicates which non-header cell contains the first value you want to import. It assumes a grid layout and you only need to indicate the position of the top-left cell value.
  • columns is optional. It is the list of columns you want to make available for the migration. In case of files with a header row, use those header values in this list. Otherwise, use the default title for columns: A, B, C, etc. If this setting is missing, the plugin will return all columns. This is not ideal, especially for very large files containing more columns than needed for the migration.
  • row_index_column is optional. This is a special column that contains the row number for each record. This can be used as unique identifier for the records in case your dataset does not provide a suitable value. Exposing this special column in the migration is up to you. If so, you can come up with any name as long as it does not conflict with header row names set in the columns configuration. Important: this is an autogenerated column, not any of the columns that come with your dataset.
  • keys is optional and, if not set, it defaults to the value of row_index_column. It contains an array of column names that uniquely identify each record. For files with a header row, you can use the values set in the columns configuration. Otherwise, use default column titles like A, B, C, etc. In both cases, you can use the row_index_column column if it was set. Each value in the array will contain database storage details for the column.

Note that nowhere in the plugin configuration you specify the file type. The same setup applies for both Microsoft Excel and LibreOffice Calc files. The library will take care of detecting and validating the proper type.

Migrating spreadsheet files with a header row

This example is for the paragraph migration and uses a LibreOffice Calc file. The following snippets shows the UD Example Sheet worksheet and the configuration of the source plugin:

book_id, book_title, Book author
B10, The definitive guide to Drupal 7, Benjamin Melançon et al.
B20, Understanding Drupal Views, Carlos Dinarte
B30, Understanding Drupal Migrations, Mauricio Dinarte
source:
  plugin: spreadsheet
  file: modules/custom/ud_migrations/ud_migrations_sheets_sources/sources/udm_book_paragraph.ods
  worksheet: 'UD Example Sheet'
  header_row: 1
  origin: A2
  columns:
    - book_id
    - book_title
    - 'Book author'
  row_index_column: 'Document Row Index'
  keys:
    book_id:
      type: string

The name of the plugin is spreadsheet. Then you use the file configuration to indicate the path to the file. In this case, it is relative to the Drupal root. The UD Example Sheet is set as the worksheet to process. Because the first row of the file contains the header rows, then header_row is set to 1 and origin to A2.

Then specify which columns to make available to the migration. In this case, we listed all of them so this setting could have been left unassigned. It is better to get into the habit of being explicit about what you import. If the file were to change and more columns were added, you would not have to update the file to prevent unneeded data to be fetched. The row_index_column is not actually used in the migration, but it is set to show all the configuration options in the example. The values will be 1, 2, 3, etc.  Finally, the keys is set the column that serves as unique identifiers for the records.

The rest of the migration is almost identical to the CSV example. Small changes were made to prevent machine name conflicts with other examples in the demo repository. For reference, the following snippet shows the process and destination sections for the LibreOffice Calc paragraph migration.

process:
  field_ud_book_paragraph_title: book_title
  field_ud_book_paragraph_author: 'Book author'
destination:
  plugin: 'entity_reference_revisions:paragraph'
  default_bundle: ud_book_paragraph

Migrating spreadsheet files without a header row

Now let’s consider an example of a spreadsheet file that does not have a header row. This example is for the image migration and uses a Microsoft Excel file. The following snippets shows the UD Example Sheet worksheet and the configuration of the source plugin:

P01, https://agaric.coop/sites/default/files/pictures/picture-15-1421176712.jpg
P02, https://agaric.coop/sites/default/files/pictures/picture-3-1421176784.jpg
P03, https://agaric.coop/sites/default/files/pictures/picture-2-1421176752.jpg
source:
  plugin: spreadsheet
  file: modules/custom/ud_migrations/ud_migrations_sheets_sources/sources/udm_book_paragraph.ods
  worksheet: 'UD Example Sheet'
  header_row: 1
  origin: A2
  columns:
    - book_id
    - book_title
    - 'Book author'
  row_index_column: 'Document Row Index'
  keys:
    book_id:
      type: string

The plugin, file, amd worksheet configurations follow the same pattern as the paragraph migration. The difference for files with no header row is reflected in the other parameters. header_row is set to null to indicate the lack of headers and origin is to A1. Because there are no column names to use, you have to use the ones provided by the spreadsheet. In this case, we want to use the first two columns: A and B. Contrary to CSV migrations, the spreadsheet plugin does not allow you to define aliases for unnamed columns. That means that you would have to use A, B in the process section to refer to these columns.

row_index_column is set to null because it will not be used. And finally, in the keys section, we use the A column as the primary key. This might seem like an odd choice. Why use that value if you could use the row_index_column as the unique identifier for each row? If this were an isolated migration, that would be a valid option. But this migration is referenced from the node migration explained in the previous example. The lookup is made based on the values stored in the A column. If we used the index of the row as the unique identifier, we would have to update the other migration or the lookup would fail. In many cases, that is not feasible nor desirable.

Except for the name of the columns, the rest of the migration is almost identical to the CSV example. Small changes were made to prevent machine name conflicts with other examples in the demo repository. For reference, the following snippet shows part of the process and destination section for the Microsoft Excel image migration.

process:
  psf_destination_filename:
    plugin: callback
    callable: basename
    source: B # This is the photo URL column.
destination:
  plugin: 'entity:file'

Refer to this entry to know how to run migrations that depend on others. In this case, you can execute them all by running: drush migrate:import --tag='UD Sheets Source'. And that is how you can use Microsoft Excel and LibreOffice Calc files as the source of your migrations. This example is very interesting because each of the migration uses a different source type. The node migration explained in the previous post uses a Google Sheet. This is a great example of how powerful and flexible the Migrate API is.

What did you learn in today’s blog post? Have you migrated from Microsoft Excel and LibreOffice Calc files before? If so, what challenges have you found? Did you know the source plugin configuration is not dependent on the file type? Share your answers in the comments. Also, I would be grateful if you shared this blog post with others.

Next: Defining Drupal migrations as configuration entities with the Migrate Plus module

This blog post series, cross-posted at UnderstandDrupal.com as well as here on Agaric.coop, is made possible thanks to these generous sponsors. Contact Understand Drupal if your organization would like to support this documentation project, whether it is the migration series or other topics.

Sign up to be notified when Agaric gives an online or in-person migration training:

Comments

2020 July 14
Renaud

Permalink

The code snippet under …

The code snippet under «Migrating spreadsheet files without a header row» shows the wrong code. It should be:

source:
  plugin: spreadsheet
  file: modules/custom/ud_migrations/ud_migrations_sheets_sources/sources/udm_photos.xlsx
  worksheet: 'UD Example Sheet'
  # The file does not have a headers row.
  header_row: null
  origin: A1
  # If no header row is available, you use the spreadsheet's column names: A, B, C, etc.
  # If you do not manually add a list of columns, all columns that contain data in the worksheet would be returned.
  # The same names need to used in the process section.
  columns:
    - A # This column contains the photo ID. Example: 'P01'.
    - B # This column contains the photo URL.
  row_index_column: null
  keys:
    A:
      type: string

 

Add new comment

The content of this field is kept private and will not be shown publicly.

Markdown

The comment language code.
CAPTCHA Please help us focus on people and not spambots by answering this question.