A transformer-based model for processing and generating vector single-lines. This project implements a LineLM (Bert-encoder and GPT-decoder) architecture for handling geospatial line data.
LineLM provides two main functionalities:
- Pre-training: Masked language modeling on vector lines using the encoder of LineLM
- Fine-tuning: Sequence-to-sequence learning using a LineLM model
The models are designed to process geospatial vector lines represented as sequences of coordinate pairs, making them suitable for line refinement to correct distortions, fill gaps, and restore connectivity.
- Purpose: Self-supervised pre-training on vector lines
- Architecture: BERT-based encoder with separate embeddings for X and Y coordinates
- Training: Masked language modeling with 15% token masking
- Input: Single vector lines
- Output: Predictions for masked X and Y coordinates
- Purpose: Sequence-to-sequence learning for vector lines
- Architecture: BERT encoder + GPT decoder
- Training: Teacher forcing with cross-entropy loss
- Input: Source and noisy line sequences
- Output: clean line sequences
LineLM/
├── README.md # This file
├── pretrain_bert_large.py # Pre-training script
├── fine_tune_large.py # Fine-tuning script
├── utils.py # Data loading utilities
└── model/
├── bert_pretrain.py # MaskedBERT model definition
├── bert.py # LineLM model
├── dataloader_mlm.py # DataLoader for pre-training
└── dataloader.py # DataLoader for fine-tuning
This project uses Poetry for dependency management. Install dependencies using:
# Install Poetry if you haven't already
curl -sSL https://install.python-poetry.org | python3 -
# Install project dependencies
poetry install
# Activate the virtual environment
poetry shellThe models expect GeoJSON files with the following structure:
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "LineString",
"coordinates": [[x1, y1], [x2, y2], ..., [xn, yn]]
}
}
]
}{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": [
[[x1, y1], [x2, y2], ..., [xn, yn]], # First trajectory
[[x1, y1], [x2, y2], ..., [xm, ym]] # Second trajectory
]
}
]
}Create a data/ directory and place your GeoJSON files:
mkdir data/Place your GeoJSON files for pretraining in the data/ directory.
Place your input and ground truth GeoJSON files for fine-tuning in the data/ directory.
You can download the GeoJSON files for both pretraining and fine-tuning from this link.
Pre-training enables LineLM to learn the intrinsic structure patterns of vector line geometries in a self-supervised manner. This foundational understanding is crucial before fine-tuning for specific tasks. LineLM is trained using a Masked Language Modeling (MLM) approach. Run the pre-training script to learn general trajectory representations:
python pretrain_bert_large.pyAfter pre-training, fine-tune LineLM on a specific task, i.e., refining broken vector lines. The script (fine_tune_large.py) loads the pre-trained weights and trains the full encoder-decoder model on your paired dataset (noisy lines and their clean counterparts). Run the fine-tuning script for sequence-to-sequence learning:
python fine_tune_large.py- Model checkpoints saved every 20 epochs in
./trained_weights/pretrain_trainset_large/ - Format:
bert_pretrain_e{epoch}.pth
- Model checkpoints saved every 20 epoch in
./trained_weights/fine_tune_large/ - Format:
LineLM_fine_tune_e{epoch}.pth
| Model Type | Use Case | Download Link | Size |
|---|---|---|---|
| Pre-trained Encoder | Foundational model for custom fine-tuning. | Download | 642.7MB |
| Fine-tuned (Gaps) | For refining lines with many gaps and connectivity issues. | Download | 1.13GB |
| Fine-tuned (Distortion) | For refining lines with many branches and distorted geometry. | Download | 1.13GB |
Model Name: LineLM (Line Language Model)
Model Type: Transformer-based encoder-decoder for geospatial vector lines
Architecture: BERT encoder + GPT decoder
Training Data: Vector line geometries from geospatial datasets
Use Cases: Line refinement, gap filling, distortion correction, connectivity restoration
Input Format: Sequences of coordinate pairs
Output Format: Refined coordinate sequences
License: MIT
Notes:
- Maximum sequence length: 512 coordinate pairs
- Coordinate range: [0, 500]
- Performance may degrade on highly complex multi-line geometries
The inference process is designed to be iterative, allowing LineLM to refine line geometries over multiple passes. Each iteration consists of generating input data from the previous step's output, running LineLM, and stitching the results back together at the map-level.
The first step processes an initial input GeoJSON file, breaks it into patches, and runs the model on them.
python iterative_inference.py \
--iteration 0 \
--map_dir ./data/maps \
--in_geojson_dir ./data/inference_input_data \
--out_geojson_dir ./inference_output_data \
--in_geojson_name my_map_processed \
--map_name my_map \
--model_version 100 \
--cuda 0- Output: This will create a directory
./inference_output_data/my_map_iter0/containing the processed results, includingmy_map_post.geojson.
Subsequent iterations use the output of the previous step (_post.geojson) as the new input, allowing for progressive refinement.
python iterative_inference.py \
--iteration 1 \
--map_dir ./data/maps \
--extract_geojson_dir ./data/inference_input_data \
--out_geojson_dir ./inference_output_data \
--in_geojson_name my_map_processed \
--map_name my_map \
--model_version 100 \
--cuda 0- How it works: For
iteration 1, the script automatically looks for the output fromiteration 0(i.e., in./inference_output_data/my_map_iter0/my_map_post.geojson) to use as its input. --extract_geojson_dir: This is required for iterative steps to reference the original, unprocessed lines for context.
--iteration: The current pass of the inference process (e.g., 0, 1, 2...).--map_dir: Directory containing the base map image files (.tifor.png).--in_geojson_dir: (Iteration 0 only) Directory containing the initial input GeoJSON file.--extract_geojson_dir: (Iteration > 0) Directory of the original unprocessed GeoJSON, used for context in refinement steps.--out_geojson_dir: The base directory where all output folders will be created (e.g.,my_map_iter0,my_map_iter1).--in_geojson_name: The name of the input GeoJSON file (without extension).--map_name: The base name of the map, used for creating output directories and files.--model_version: The specific epoch/version of the fine-tuned model to use.--cuda: The ID of the CUDA device to use for GPU acceleration.
You can download the GeoJSON files for inference from this link.
This project is licensed under the MIT License - see the LICENSE file for details.