Moving Repositories Between Project Hosting Platforms

No matter how tightly developers are committed to their current project hosting provider (GitHub, GitLab, GNU Savannah, or whatever), new ones will come along over time. The history of web services is replete with turnover, and project hosting forges all follow the inevitable trend. But the cost of migration is formidable: It’s quite easy to setup a new project host like GitLab, but how do you move the whole structure of your team’s code, branches, comments, issues, and merge requests into their new home?

Software Heritage, a non-profit with the mission of archiving free software code, faced this daunting challenge when they decided to move from Phabricator to the more vibrant GitLab. For a while, a lot of free and open source projects had found Phabricator appealing, but the forge had been gradually declining and officially ceased development in 2021.

At OTS, we developed an open source tool and framework to support migrating to a new project hosting platform. We used it to move all of Software Heritage’s projects from Phabricator to GitLab, but the framework is robust enough to support migrations between almost any project hosts.

The tool is called Forgerie. Its goal is to automate the migration of projects from one hosting system to another. Forgerie is extendable to any source and destination. It translates input from a project hosting platform into a richly-featured internal format, then exports from that format to the destination platform.

This is the same method used by many tools that perform n-to-n migrations. For instance, the health care field contains many incompatible electronic record systems, so migration tools usually create an intermediate format to cut down on the number of necessary format conversions.

OTS continues to work on Forgerie as part of its offering of migration services to clients. If you would like to use Forgerie, please grab it from Forgerie’s GitLab page or contact us if you would like help with a migration.

The rest of this post offers some technical background on Forgerie. It should be of interest to anybody solving similar project hosting problems or, more generally, to anybody working on moving structured data into a new data store. Many migration projects fall into the traditional category of Extract, Transform, Load (ETL), but the richness of data stores today stretches the category into new realms.

Forgerie

The Forgerie code was initiated by OTS developer Frank Duncan and released under the GNU Affero General Public License v3.0. This post delves into the project goals along with suggestions for the future of this project. We’ll look at the difficulties posed by this major migration project and how we handled them. This story may offer lessons and tips to people dealing with all kinds of data migration.

If you have used a project hosting system, you might well be imagining the massive requirements for even such a limited project. Code in a forge exists in many branches, each created by multiple commits and enhanced by merges. Numerous issues (change requests) have been posted by different users, along with comments that refer to the issues by number. Commit messages also link and refer to the numbers of issues and branches.

The need for a general project hosting migration tool

Tools for importing projects exist for various project hosting platforms, but they are limited. GitLab does a pretty good job importing a repository from GitHub, and GitHub from GitLab, and both allow the uploading of a private repository. Later in this article we’ll examine one particular limitation of all these import tools: handling multiple contributors.

To automate the migration from Phabricator to GitLab, Software Heritage contracted with Open Tech Strategies (OTS), a free and open source software consulting firm. Preliminary research turned up a few tools claiming to perform the migration, but none of them did a complete job. And each migration tool works only with one particular forge as input and another as destination. OTS decided to design its new tool as a general converter that could be adapted to any source and target repositories.

Migration thus requires the automated tool to reproduce, on the target forge, all the projects, branches, commits, merge requests, merges issues, comments, and users recorded in the source repository. If possible, contributors should be associated with their contributions.

OTS chose to create Forgerie in Common Lisp, which seems like an odd choice in the 2020s. But Common Lisp is well-maintained and robust. Its big advantage for the Forgerie project was that Lisp makes database-to-dictionary conversions easy. Because Phabricator stores data in a relational database, database-to-dictionary conversions were the central task in automating the migration.

The Forgerie project has three subdirectories: a set of core files used by all migrations, egress files for Phabricator, and ingress files for GitLab. This design leaves room for future developers to extend the project by adding more ingress and egress options. In order to go from Phabricator to GitHub, for instance, a maintainer can reuse the existing core and Phabricator directories.

Impedance mismatches create challenges

All forges offer basic version control features, along with communication and management tools such as issues. But each forge is also unique. In this case, Duncan had to decide how best toaccommodate features that differ or are missing in the target GitLab platform.

The biggest challenge Duncan faced is that GitLab maps projects to repositories on a one-to-one basis, whereas Phabricator treats a project as a higher-order concept. A project in Phabricator can contain multiple repositories, and a repository can be part of many projects. Phabricator also supports multiple version control tools (Git, Mercurial, etc.). Making Forgerie flexible enough to smooth over these types of differences in data structure was a key goal.

The different approaches to projects introduced several complications. First, Duncan had to make sure that each message and ticket pointed to the right GitLab project.

Merge requests were the hardest elements to migrate, because in Phabricator a changeset can span multiple repositories. The requirement that Duncan had to implement was to preserve the sequence of events in the original forge strictly, so that issue 43 in the old forge remains issue 43 in the new forge. That way, any email message or comment referring to the issue still refers to the right one.

Lots of details had to be tidied up. For instance, Phabricator has its own markup language to add rich text to comments and issues. This language had to be converted to Markdown to store the comments in GitLab.

The question of multiple contributors

When there are many people to credit for their contributions, the import tool has a tough nut to crack. Awarding credit properly is crucial because many contributors rest their reputations on the record provided by their contributions. Statistics about the number of commits they made, the “stars” they got, etc. undergird their strategies for employment and promotions. Losing that information would also hurt the project by making it hard to trace changes back to the responsible person.

On the other hand, security concerns preclude allowing someone to import material and attribute it to somebody else.

GitLab solves this problem if the input repository is set up right: The person doing the import needs master or admin access and has to map contributors from the input repository to the destination respository. If access rights don’t allow the import to add material to a contributor’s repository, GitLab’s import can accurately attribute issues to the contributor, but not commits.

Forgerie goes farther in preserving the provenance of contributors: It keeps track of Phabricator users and creates a user in GitLab for each user recorded in the Phabricator repository. The Software Heritage project did not present difficulties because no contributor had an account in GitLab. To be precise, the email address that identified each Phabrictor contributor didn’t already exist for any GitLab contributor. If GitLab had an account with the same email address as an account being imported, the system would have issued an error and prevented Forgerie from importing the contributor’s commits.

A few implementation details

Forgerie carries out a migration by creating a log of everything that happened in the source repository, and replaying the log in the target forge. Phabricator uses a classic LAMP stack, storing all repository information into a MySQL database. Forgerie queries this database to retrieve each item in order, then invokes the GitLab API to create the item there.

The GitLab API is relatively slow for those particular types of request, requiring one or two seconds for each request, and repositories can contain tens of thousands of items when you count all the merges, comments, etc. So you can expect a migration to take 24 hours or more.

Long runs call for checkpoints and restarts. When Duncan designed the simple version of Forgerie for him to run just once, he figured he could just restart the run if it failed. Later he realized that restarting after 23 hours became unacceptable.

The log solves this problem through a kind of simple transaction. You can conceive of the migration as moving through three stages (Figure 1). In the first stage, items are in the old platform but not the log. In the second stage, Forgerie adds the items to the log. In the third stage, items are safely loaded into the destination platform and can be removed from the log. Should the job fail, the user can restart it from the beginning of the log.

Figure 1: Logging items as they move from source to destination platform.

A classic issue with transactions arises with a log: Suppose an item has just entered the target forge but Forgerie did not have a chance to remove the item from the log before a failure. The item exists in both the target repository and the log, so when Forgerie starts up again, the item will be added a second time to the repository. Forgerie developers do not have to worry about this happening because the insertions are idempotent. The second insertion overwrites the first with no corruption of information.

Assessing the Forgerie project

The Forgerie code base is surprisingly small–a total of 2,726 lines, divided as follows:

     • Core (shared) code: 350 lines

     • Phabricator-specific code: 1,233 lines

     • GitLab-specific code: 1,143 lines

No platform lives forever. Amazing as the capabilities of GitHub and
GitLab are—and they continue to evolve—there will come a time when developers decide they have to pick up and move their code to some glorious new way of working. Forgerie tries to make migration as painless as possible.

Thanks to Andy Oram for assistance drafting this post, to Jim McGowan for making the diagrams, and to Antoine R. Dumont of Software Heritage for contributing technical improvements to the Forgerie project.

Need help migrating off Phabricator?

Open Tech Strategies can help you migrate off of Phabricator now that it has reached end-of-life. We developed Forgerie, an open source tool that aids in migration between code forges. Forgerie extracts data from your Phabricator instance and injects it into a GitLab instance. It can also help move repositories to a GitHub account.

Using Forgerie is a process. It requires setting parameters, running the tool, examining the results, tweaking the parameters, and re-running until the result meets your needs. Our team can help you with this process. We can move you to your own GitLab instance, host an instance for you, or get you migrated to GitLab.com or GitHub.

We’d love to help you transition from Phabricator. Drop us a line at info AT opentechstrategies.com and we’ll get you safely to your new home.

Keep Your Friends Close

Picture of Mount Rushmore, by Dean Frankling, CC-By-SA
Mount Rushmore

(This is the sixth post in our Open Source At Large series.)

One of the insider secrets of free and open source software (FOSS) is that most of the rules a project uses on a day-to-day basis are not found in the software’s license. There are contribution guidelines, which are enforced by the project only taking contributions that meet them. There are codes of conduct, which are a condition of community participation. There are endorsements, official membership, a voice in setting the project roadmap, and all kinds of other benefits that attach to varying types of community participation. In each case, entirely external to the license, there are official rules and unwritten norms that govern how participants gain the benefits of joining the civic life of a project.

If you were to make an ecosystem map of an open source project, you might place the project in the middle of the page and then depict scale of involvement as distance from that center. The closer to the center a participant sits, the more influence the project has on them; the further from the center, the less sway the project has.

At the center is the project itself: its core developers and the people who have made commitments that affect the project’s outputs and actions. A project has a lot of visibility into how these participants act because tight, highly-connected cooperation is beneficial for everyone, and so participants are motivated to act in ways that avoid damaging that cooperation. This mechanism is so natural that most projects do not often think of it as something they could expand intentionally. But sometimes projects do exactly that: they figure out ways to deliberately widen their sphere of influence.

For example, Joomla, maintains a directory of third party extensions. It is the way most users discover Joomla extensions. For many businesses based on providing Joomla extensions, absence from that directory is akin to not existing at all. When the Joomla project decided to tighten license compliance among its extension developer community, they didn’t ask their lawyer to run around issuing threats. They simply explained that any project that wanted to appear in that directory must abide by community rules, https://docs.joomla.org/Extensions_and_GPL. Extension developers came into line.

A similar example can be seen in the Guidelines for Commercial Entities at the Arches Project. A glance over the guidelines will show the kinds of real-world problems they were developed to address. Only those who agree to the guidelines are listed in the official directory of Arches service providers.

Of course, being in some kind of project-endorsed directory is just one type of gateway. Another is participation in the project at all, that is, the ability to take part in project discussions, to vote (when there are decisions made by vote), and to have one’s contributions evaluated and accepted by the project with full attribution. Getting contributions accepted into the core project on a regular basis is important for those whose businesses depend on the project. If they can’t get their bugfixes and new features accepted upstream, then they may be forced to maintain their own divergent version (the term of art is “vendor branch”) indefinitely — a situation whose technical and organizational costs only get worse over time.

The right techniques will differ from project to project, because they must be based on the particular project’s history (as in the examples above). But the general reason these techniques work is that the non-code parts of a project are valuable in their own right. Those parts are not covered by the code’s license, but rather by the project’s norms and rules. Crucially, these parts cannot be replicated: unlike the code, you can’t make a copy of a community, or of a developer’s attention, or of an endorsement’s value. Equally crucially, none of them can be demanded by bad actors. The benefits of participation flow naturally to community members in good standing and it is equally natural to deny them to people and firms that refuse to align themselves with the community ethos. Creating structures that allow projects to control access to community benefits is a powerful way to enforce norms.

Using community participation as the mechanism for promulgating norms has its limits. Some participants stay far enough from the center of the project that they are effectively immune to community inducements. (Fortunately, projects have other mechanisms available to influence them, and we will cover some of those in a future post.) But in most cases, organizations that have a core reliance on the code will find multiple reasons to stay in good standing with the community, and this means the project has a chance to influence how those organizations behave. Spotting these leverage points takes experience as well as an understanding of project goals and positioning. Projects that want to wield influence over their ecosystem — whether for strategic or ethical ends — should actively look for ways to provide value backed by network effects, until the case for participation is overwhelming.

Thanks to Microsoft for sponsoring the Open Source At Large blog series.

Be Open From Day One, Not Day N.

Note: This is an updated version of an article I first wrote in 2011. The original site went offline for a while, and although it was later restored, thanks to heroic efforts by Philip Ashlock, I felt the article needed a new home, and wanted a chance to update it anyway. This version also incorporates some suggestions from V. David Zvenyach.

Over the years we’ve watched software projects of all sizes make the transition from closed-source to open source. The lesson we consistently draw from them is this:

If you’re running a software project and you plan to make it open source eventually, then just make it open source from the beginning of development.

Waiting will only create more work.

The longer a project is run in a closed-source mode, the harder it will be to open source later. Continue reading “Be Open From Day One, Not Day N.”

Field Guide To Open Source Project Archetypes

Open Source Archetypes report cover.Open source is a broad term that encompasses many different types of projects. There is a wide range of open source approaches, and sometimes it helps to think through how your open source approach matches your goals, resources, and environment. In many places we look, we see open source used as a catch-all term to refer to every project. We don’t have a common vocabulary to discuss open source in ways that take account of important differences.

OTS prepared a field guide to open source project archetypes with Mozilla that is a first step in addressing that problem. The report catalogs a number of open source archetypes we observe around the community. OTS and Mozilla have found these archetypes to be a useful resource when crafting strategy, weighing tradeoffs, and committing support to open source endeavors. Today, we share the results of this work with the community. Continue reading “Field Guide To Open Source Project Archetypes”

Decentralization: Worth The Wait

Ethan Zuckerman has a piece in Wired that says building decentralization tools is a sucker’s bet. He and his coauthors, Chelsea Barabas and Neha Narula, mention the FreedomBox, which I helped lead, as an example of how difficult this stuff is. They point to a list of things that make decentralized efforts prone to failure and conclude:

Our research—a combination of technical and historical analysis, and dozens of interviews with open web advocates—indicates that there is no straightforward technical solution to the problem of platform monopolies. Moreover, it’s not clear we can solve the nuanced issues of centralization by pushing for “re-decentralization” of publishing online. The reality is that most people do not want to run their own web servers or social network nodes. They want to engage with the web through friendlier platforms, and these platforms will be constrained by the same forces that drive consolidation today.

They point to a “better strategy” of policies aimed at “data portability, interoperability, and alternatives to advertising-based funding models”.

I’m no longer with FreedomBox, and none of what they write is wrong (I was one of the open web advocates they interviewed), but I wanted to chime in because there’s more to decentralization than seeking a “straightforward technical solution” and building a better social networking app. It’s true that we haven’t realized the grand vision of Diaspora and FreedomBox. They’re right that we need enlightened policy. We need the centralized platform monopolies to behave better. Those steps, though, won’t ever give people control over the means of communication. Without that control, we’ll always be at the mercy of Facebook or whatever comes next. Continue reading “Decentralization: Worth The Wait”

Report on GeoNode’s path to open source success

A map of flood zones in Haiti, rendered with GeoNode. Vulnerable areas are highlighted in red, over a base map showing a street map of Port-au-Prince. The mouse is hovering over a menu item which explains that "This map layer modelizes areas of frequent flooding for Port-au-Prince region."
Haiti Flood Zone map. Source: http://haitidata.org/maps/153/view.

Recently, OTS was asked to write a report about the GeoNode project by one of its primary sponsors, the Global Facility for Disaster Reduction and Recovery (GFDRR) a global partnership that is managed by the World Bank.  GeoNode is a facility for sharing and displaying geographical information.  It is “web-based, open source software that enables organizations to easily create catalogs of geospatial data, and that allows users to access, share, and visualize that data,” as we put it in our report. Continue reading “Report on GeoNode’s path to open source success”

Attending the Wontfix Cabal

Pile of stickers that read "wontfix_" in green monotype font on a black background.
Courtesy of Jess Frazelle.

GitHub hosted the “Wontfix Cabal” last week in San Francisco, and I was lucky enough to attend, thanks to a pointer from a friend.  The organizers, led by Jess Frazelle, conceived the gathering as a chance for people maintaining open source projects to discuss their particular difficulties and some strategies for addressing them.  About 100 maintainers took them up on it and convened in San Francisco.  At the beginning of the day, we compiled dozens of sticky notes worth of problems that come up in our various projects.  These ranged from tips for maintainers who had received the classic question “Is this project still maintained?” to “Ethics and Exploitation in Open Source.”  In a common unconference pattern, attendees voted on those topics.  The most popular topics were chosen to become discussion groups, and we split up for a morning session and an afternoon session. Continue reading “Attending the Wontfix Cabal”

Sharing data across Red Cross projects: the Smoke Alarm Portal and allReady

smokealarm-allready-link

OTS has been lucky enough to work with the Red Cross of Chicago and Northern Illinois (CNI) for the past year and a half, thanks in large part to the civic data community at Chi Hack Night.  With Jim McGowan, CNI’s Director of Planning and Situational Awareness, we developed the open source Smoke Alarm Request Portal, where you can request to have a volunteer come install a smoke alarm in your home for free.

Now we’re connecting the Smoke Alarm Request Portal to allReady, an open source platform for volunteer preparedness and coordination (part of the Humanitarian Toolbox suite of disaster preparedness and prevention tools).  With help from both open source communities, the two applications will share data to simplify the process of scheduling smoke alarm installations. Continue reading “Sharing data across Red Cross projects: the Smoke Alarm Portal and allReady”

Open Source Code of Conduct for Commercial Entities (DRAFT)

Note: This is a draft of a Code of Conduct meant to help a specific open source project give guidance to its commercial participants. The project is already in production use, and is successful enough that some commercial entities have become involved, offering support, hosting services, customization, etc. However, those companies need some guidelines about how to conduct themselves, in relation to the project as a whole and in relation to each other.

The first half of the draft is aimed at commercial entities. The second half of the draft contains guidelines for the open source project itself — healthy commercial participation being a two-way street.

Once the text is finalized, we plan is to post this in generic form, as a template that other projects can use, while delivering a customized version to that project.

In the meantime, comments welcome! You can simply leave regular blog comments, but we’ve also enabled the open source Hypothes.is WordPress annotation plugin to allow sidebar annotations of selected text. To use it, just mouse-select any passage of text, as though you were going to copy it to the clipboard, and then wait for the Hypothes.is annotation action buttons to pop up right under your selected text. There should be two of them: “Annotate” and “Highlight”. Choose “Annotate”, and then, at the top of the right-hand sidebar that should now open up, create a free account at Hypothes.is, or sign in if you already have an account. (We need the authentication step to help prevent spam annotations.) Once you are signed in, you can leave a sidebar comment associated with a specific passage of text — the user interface should be pretty clear from this point on. Please note that your comments will be publicly visible by default; you can also make an annnotation that’s private to yourself, but then we wouldn’t be able to see it either of course.

Continue reading “Open Source Code of Conduct for Commercial Entities (DRAFT)”